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РКЕЕАСЕ 


THE present century has seen extensive developments in methods of 
measuring scientifically the traits and abilities of human beings. 
Three main influences may be discovered behind this movement. 
First, there is the tremendous interest which people have always 
shown in judging one another's qualities. Much of their conversation 
consists in summing up their acquaintances. Their plans, either for 
work or play, are very largely guided by their notions regarding the 
persons with whom they are about to deal. It is obvious that such 
notions are frequently biased or inaccurate, and yet each one con- 
tinues to rely, in great affairs or small, chiefly on his own insight into 
the characters of others. However, it is recognized that in certain 
spheres of human activity, more accurate and impartial assessments 
are ‘essential. The examination system, now firmly established in 
schools, universities, and in many professions, is supposed to meas- 
ure the achievements and capacities of pupils, of students, and of 
candidates for various jobs. Although this system is certainly an 
advance on the earlier system of personal judgments, yet much 
doubt exists as to its efficiency. A considerable section of this book 
is therefore devoted to a discussion of the defects of examinations, 
and of present-day methods of marking pupils’ or students’ work, 
and to a study of how these may be improved. 

Secondly, mankind is interested, not only in individuals, but also 
in general problems of human nature. and human institutions. Thus 
another field where accurate measures of human qualities are needed 
is the scientific investigation of mind, behaviour, and society. Loose 
thinking and prejudice are extremely prevalent in matters of politics, 
economics, religion, crime, and the like—as is shown, for example, 
by Thouless in his book Straight and Crooked Thinking (1930). Yet 
the social sciences, such as psychology and sociology, are endeavour- 
ing to study these subjects rationally and impartially, in the same 
manner as the physicist studies the atom, or the physiologist the 
workings of the body. Now the data from which a science is built 
up are very largely numerical and quantitative. The records of a 
chemical or physical experiment usually consist of measures of 
weights, times, distances, galvanometer or thermometer readings. 
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Measurement is almost equally important for purposes of precise 
description and experiment in the social sciences ; and the deduction 
of sound conclusions from such quantitative data frequently involves 
the application of statistical treatment. In this connection, then, we 
shall consider below the nature of measurement in psychology, and 
the elements of statistical technique, with especial reference to the 
field of education. 

A third source of the mental testing movement lies in the desire 
for human betterment. Medical science has advanced enormously in 
its ability to diagnose and treat bodily ills, but psychiatry and psy- 
chology are only beginning to learn how to handle mental abnor- 
malities. Yet there are already, in addition to the mental hospitals, 
numerous clinics where the diagnosis and treatment of emotionally 
maladjusted patients—adults and children—are carried out. Special 
schools, and tutorial classes in some ordinary schools, play their 
part by providing a type of education suited to children who are too 
greatly handicapped, mentally or physically, to make progress under 
ordinary school conditions. Industrial schools and the Borstal system 
aim at re-educating delinquents and budding criminals, so as to turn 
them into useful members of society. The staffs of all these institu- 
tions need accurate methods of measuring the educational and 
occupational capacities of the ‘patients.’ Actually many of our most 
valuable mental tests were first constructed by workers in the fields 
of educational, emotional, and vocational guidance. 

This book is thus intended primarily for three classes of students, 
namely teachers and examiners, psychologists and others who set 
out to investigate human phenomena from the scientific standpoint, 
and doctors and psychologists who wish to make practical use of 
tests in clinics and schools. For reasons of space, no attempt is made 
to describe the various mental tests in detail, but a classified list of 
those used in this country is included in Chap. IX. Many authors 
have emphasized the value of tests without sufficiently drawing 
attention to their dangers and difficulties. Thus the main object of 
this book is to outline the principles of test construction and applica- 
tion or procedure, and the interpretation of the results of tests and 
examinations, ignorance of which on the part of testers has fre- 
quently led to serious errors. Mental measurement undoubtedly needs 
a high degree of skill, which few of the persons just mentioned have 
either the time or the opportunity to acquire thoroughly, before they 
set out to test or examine. 
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There are also available many excellent textbooks on statistical 
methods, perhaps the most useful to testers being those of Garrett 
(1953), Dawson (1933), and Guilford (1936). Yet in the present 
writer's experience, all of these are too difficult for the majority of 
students whose psychological training only amounts to between 
seventy-five and one hundred and fifty hours. This book aims, 
therefore, to describe the minimum essentials, to show their imme- 
diate application to psychological (and especially to educational) 
problems, and to stress their general significance rather than their 
technical details. 

More advanced students are unlikely to find within much that is 
original. But, so far as the writer knows, no previous account of the 
statistics of marking, nor of the construction of new-type examina- 
tions, has been published by a British author. The analysis of educa- 
tional measurement and the discussion of the technique of examining 
are partly new but owe a great deal to Hamilton's brilliant book, 
The Art of Interrogation (1929). Possibly also some of the practical 
hints about tests and testing have not been published before. 

Lack of space has necessitated some important omissions, such as 
the whole field of temperament and character assessment, the testing 
of special (vocational) aptitudes, and of sensory-motor capacities. 
Actually few of the tests falling under these headings are well stan- 
dardized, and fewer still can safely be applied by those who are not 


trained psychologists. 
Р.Е.У. 
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PREFACE TO THE SECOND EDITION 


THE general principles of mental testing and elementary statistical 
analysis have changed but little in the sixteen years since this book 
was written. But the educational context in which they were origin- 
ally set has altered considerably, notably with the passing of the 1944 
Education Act. This edition, then, brings the book up to date in 
respect of selection for secondary education, though it is still, of. 
course, concerned only with the critical study of methods—not with 
the selection problem as a whole. Next, there have been important 
developments in the field of intelligence theory and intelligence 
testing, and it is hoped that the views put forward here represent a 
more reasonable picture of the relations of innate and acquired 
abilities and of the findings of factorial analysis than has hitherto 
been available to education and psychology students. The compre- 
hensive list of published mental tests has been completely recon- 
structed. Several of the statistical sections have been revised and 
slightly expanded to cover more of the problems which the examiner, 
mental tester, or research worker are likely to meet, while avoiding 
as far as possible any increase in complexity. Thus in general the 
book covers the same ground as the original edition, at much the 
same length, but in a more up-to-date and more accurate manner. 
P. E. V. 


INSTITUTE OF EDUCATION 
UNIVERSITY OF LONDON 
May 1955 
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СНАРТЕК 1 


INTRODUCTION TO STATISTICAL METHODS 


Ir is an unfortunate, yet indubitable, fact that the majority of 
students of psychology and education find extreme difficulty in 
grasping and applying the statistical concepts which are an essential 
prerequisite to sound mental measurement. They take fright at the 
mere mention of statistics or at the sight of statistical symbols and 
formule. A small minority, on the other hand, find in the manipu- 
lations of numbers which we call statistics an intense fascination, 
closely akin to the musical composer's fascination with interweavings 
of sounds. Probably statistical ability is as specialized and as rare as 
is ability in musical composition, and the expert statistician rather 
easily loses touch with the doubts and difficulties of the beginner. 
He fails to realize that the statistical way of looking at educational 
or psychological phenomena, which comes so naturally to him, is 
at first entirely foreign to the average student. Yet the student's 
*phobia' is certainly unnecessary, since the elementary statistics 
needed for the proper treatment of marks and test scores includes 
little more than the kind of algebra and arithmetic which he mastered 
by the age of 14. Statistics means nothing more than numerical data, 
such as measurements, examination and test results; and statistical 
methods are merely the ways of dealing with, or organizing, these 
data so as to bring out their full significance. 

The raison d'étre of statistical methods is that human beings are 
apt to differ so widely from one another in any and every measurable 
characteristic, that results or conclusions which apply to one person 
orone group of persons (e.g. one school class), do not necessarily apply 
to another person or group. First, then, we must have efficient tech- 
niques of arranging or tabulating the scores and marks obtained from 
different people,inordertosee whatis the general run oftheirscores,and 
what is the range or extent of the differences. These techniques are 
described in the present chapter. The ‘general run’ and the ‘range or 
extent" of scores next requiremore precise definitionand measurement ; 
that is to say, we must know how to work out averages and indices 
of what is called spread, scatter, dispersion, or variability (Chap. Ш). 
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Not only do human beings differ from one another, but also each 
individual varies in his capacities and characteristics from time to ' 
time. The score he achieves to-day may or may not be typical of his 
usual accomplishments. The same holds true of a class of pupils or 
students. The extent of this untrustworthiness or unreliability (as it 
is generally termed) of a single measurement or of a set of measure- 
ments therefore needs to be determined. Statistical treatment will tell 
us whether any alteration or variation in scores represents a real and 
important change—such a change, for example, as we might hope to 
bring about by means of the teaching given to the individual or to 
the class—or whether it is merely a matter of chance fluctuations. 
A. further very common and important problem is the reality or 
significance of a difference between two groups of people. For 
instance, girls may appear to be better than boys at reading; but 
this is not true of all girls and all boys; so we must have a method of 
establishing how general and how reliable is the difference (Chap. V). 

Still another type of variation requires special techniques of 
investigation, namely, the variations in people's capacities in differ- 
ent spheres. It is obvious that no one is even all-round in his level 
of abilities. Yet we are apt to assume that high achievement in one 
field implies high achievement in other fields. Sometimes also we infer 
the opposite type of relationship—that if a person is strong in some 
trait, then he will be lacking in some other trait. The statistical 
methods of correlation enable us to say how far various traits or 
abilities ‘go together’ or are related to one another. They also tell us 
how far our tests or examinations measure what they are supposed 
to measure, i.e. how valid they are. For example, a Certificate or 
Matriculation examination at the age of 16-18 may, or may not, 
validly predict success in a subsequent university career (Chaps. 
VI-VIIJ). 

Although statistical methods are, then, of undoubted importance, 
their limitations should also be recognized at this stage in our 
discussion. They are tools for handling numerical data, but they are 
not capable of improving data which may be initially inaccurate or 
biased. When an investigator uses a poor test or examination, or 
when he observes and records people's behaviour or mental processes 
wrongly, no amount of statistical treatment of his results can correct 
such deficiencies. Further, the fundamentally abstract nature of statis- 
tical methods needs to be stressed. They are adapted primarily for 
mass-investigations of measurements obtained from large numbers 
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of persons, and such investigations are apt to lose touch with the 
concrete individual case. Without them we could not make sound 
generalizations about human beings, but even with them we cannot 
usually state whether a generalization applies to a particular human 
being. The practical handling and understanding of an actual child 
or adult is still an art rather than a science, though it may be greatly 
assisted by scientific mental tests and guided by the results of objec- 
tive psychological experiments and statistical studies. 


TABULATION OF MEASURES 

Measures and Variables.—By a variable we shall mean an ability 
or some other characteristic in respect of which human beings vary 
or differ from one another. Height, age, income, intelligence 
examination success, are all variables in this sense. A measure is the 
numerical expression of a certain standing on, or amount of, a 
variable. Fifty inches of height, five pounds a week income, and a 
mark of 6 out of 10 are examples of measures. 

Most variables are continuous; that is to say, they can vary by 
infinitely small gradations. But we only measure them to the nearest 
convenient unit. For example, heights may be measured to the near- 
est tenth of an inch. But actually 50-1 inches includes all heights from 
50-05 to 50-15; 50-2 inches includes from 50-15 to 50-25, and so on. 
Similarly, a school mark of 6 is usually an approximation represent- 
ing all degrees of quality from 53 to 63. If, however, half-marks are 
awarded, then 6 represents 53 to 64, and 63 represents 6} to 6$, and 
so on. But some variables are discrete or discontinuous, since they 
only vary by units. For instance, if the size of a class of pupils is 40, 
this does not imply between 394 and 405 pupils. The scores for many 
tests, and some school marks, are of this type, for they represent the 
exact numbers of questions or test items correctly answered. Thus a 
pupil would score 6 out of 10 for arithmetic sums even if he had 
nearly completed a seventh sum. (Such measures are referred to in 
the next chapter as count or enumeration scores.) 

This distinction between continuous and discrete variables needs 
to be borne in mind when tabulating measures (cf. p. 5). 

Tabulation.—Tabulation means arranging a set of measures in an 
orderly fashion, so as to convey the essential facts about them in 
convenient and relatively brief form. If we take a class of pupils, and 
beside each pupil’s name write his age, or exam. mark, we get a list, 
but it is not a fable, since the measures are quite haphazardly 
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arranged. It is very difficult to see from such a list, if it be a long one, 
what is the general run and range of measures, or whether any 
particular measure (e.g. John Smith's 6 out of 10) is high, low, or 
average. 

Ranking.—One way of tabulating is to rearrange all the measures 
in order of size from highest to lowest ; that is, to put them in rank 
order. The desired information can then be grasped much more 
readily. But this is a laborious process if the number of pupils is 
large. 

Frequency Distributions.—Under such circumstances it is better to 
write in a column each possible measure, in order, and beside it the 
frequency, F; that is, the number of persons who obtain each 
measure. The result is called a frequency distribution, since it shows 
how the various measures are allotted or distributed among the 
persons. 


Ex. 1.—The following is a list of the marks (out of 20) which 
were awarded to a class of 50 pupils. The table below shows their 
frequency distribution. Marks: 10, 7, 5, 17, 15, 13, 4, 6, 9, 15, 16, 
20, 17, 19, 14, 9, 11, 17, 18, 8, 1, 9, 12, 19, 16, 15, 17, 14, 13, 10, 6, 
20, 15, 18, 17, 13, 11, 4, 16, 13, 8, 14, 18, 15, 12, 17, 16, 15, 17, 2. 


Measures F 
20// . 2 
19]] . 2 
18 [f „> 3 
17 4 7 
16 ////. 4 
15 6 
l4 ||], 3 
ЇЗ: (111. 4 
TS a 2 
11 Jj 2 
10 // 2 
9111 3 
811 2 
71 1 
61] 2 
51 1 
41] 2 
3 0 
2 | 1 
1] 1 


zZ 
| 

ч 
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Hints on Drawing Up a Frequency Table.—Do not look through 
the original list and pick out all who got the highest measure, e.g. 20, 
then all who got the next highest, 19, and so on. This would be a 
waste of time and would easily lead to mistakes. Instead go through 
the measures in the list consecutively, and as each one is reached, put 
a tally or check mark opposite the appropriate measure in the 
column. When any one measure recurs for the fifth time, draw a line 
through the first four tallies. Repeat this on reaching the tenth, 
fifteenth, etc., tally. Add up the tallies to obtain the F column. Add 
up the F column itself to make sure that it checks with N, the total 
number of persons. 

Distributions of Classified Measures.—The above procedure is still 
likely to produce an over-lengthy table, especially when N is large. 
It would be absurd, for example, to tabulate the heights of all the 
pupils of a school in this manner. Instead of listing the frequencies of 
children 40-0, 40-1, 40-2 . . . inches tall, we would give the totals 
with heights of 40, 41, 42 . . . inches, or the totals with heights of 
40 upwards, 42 upwards, 44 upwards, and so on. That is, we would 
group together sets of contiguous measures under a series of c/asses 
or categories. 

Considerable care is needed in deciding upon what is, or is not, 
included in a class. When the variable is discrete—for example, test 
scores ranging from 10 to 93—we may call our classes 10-14, 15-19, 
20-24, etc., or else 10 +, 15 +, 20 +, etc. Both of these indicate 
clearly that all scores of 10, 11, 12, 13, and 14 are included in the 
first class, and that the size of each class is five units. The same 
methods may be used for school marks. For instance, 10%-14%, 
1592-1997 . . 18 unambiguous, although, strictly speaking, the 
first class includes all shades of merit from 94% to 144%. But with 
some continuous variables such as age or height, difficulties often 
arise because the original measures are themselves only approxima- 
tions which represent classes of measures (cf. p. 3). Thus if height 
has been measured to the nearest tenth of an inch, and is tabulated 
to the nearest inch, we should avoid calling our classes 40, 41, 42 
. . ., etc. For 40 might include 39-95 to 40-95, or 39:45 to 40:45, ог 
39-55 to 40-55. If we write 40 +, 41 +, etc., the first of these is usually 
intended. With ages, 5 +, 6 + . . - is quite clear, since the 5+ 
class includes all children who are at, or who have passed, their fifth 
birthday, but have not yet reached their sixth. 

No set rules can be laid down as to the choi 

M.A.—2 


ce of classes, but the 
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following additional hints will apply to most of the tables which 
teachers or psychologists will need. 

(1) The number of different classes is generally somewhere between 
6 and 15. Some tables contain more, few less. When the total number 
of persons, N, is only 50 or less, a small number of classes is generally 
sufficient ; but if N is nearer 500, a more detailed table with a larger 
number of classes may be desirable. 

(2) Pick out from the original list of measures the highest and the 
lowest measure to be tabulated, and deduce from them the total 
range of the distribution. If the range is from 20 to 1, then 7 classes 
containing 3 different measures each are likely to be appropriate. If 
the range is from 137 to 65, then 15 classes of 5 are a possibility. 
In most instances all the classes should contain the same number of 
different measures, i.e. they should be of the same size.! 

(3) It is desirable, though not essential, to include an odd number 
of different measures in each class. The reason for this is that in any 
subsequent calculations we are going to regard all the measures 
grouped within any one class as lying at the midpoint of that class. 
If the classes include, say, 3—7, 8-12, 13-17, 18-22, etc., then we 
shall count all the 3's, 4’s, 5’s, 6’s, and 7’s as 5’s, and all from 8 to 12 
as 10's. The midpoint of a class is found by taking the average of the 
highest and lowest measures it can contain; e.g. (3 + 7) + 2 = 5. 
The same figure is the midpoint of a class whose true limits are 23 
to 74. But a class such as 39-95 to 40-95 has the rather inconvenient 
midpoint of 40-45. 

(4) With school marks it is advisable to arrange for any round 
numbers, such as the 5's and 10's, to fall at the centres of the classes. 
Teachers are very apt to award these round numbers and to neglect 
intermediate ones. Hence, if classes of 10-14, 15-19, 20-24 . . . are 
chosen, most of the frequencies may actually lie at 10, 15, 20 . . » 
and it would be illegitimate to regard them as lying at the midpoints 
of 12, 17, 22 . . . The difficulty will be surmounted if classes of 
8-12, 13-17, 18-22 . . . are used instead. 

(5) Having chosen the classes and written them out in a column, 
tabulate the F's by the 'tally method, as before, and check the 
total, N. 

1 If the distribution is very far from symmetrical, as in the case of incomes, 
school attendances, and the like, a table with unequal-sized classes may better 


portray the state of affairs. But no averaging or other statistical calculations can 


bn EY out with such a table. It may be analysed by the chi-squared technique 
cf. p. 88). 
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Ex. 2.—Below is the distribution of the measures already listed 
in Ex. 1. They have been grouped into classes of 3 marks each. 


Classes 5 
]8-20 FRRÀ ү + Я : " 
15-17 HE HH HH II .- < 17 
12-14 4H [111- s а A 9 
9-11 HH // 7 
6-8 HH. 5 
3-9 4I a e 3 
0-2 // З x 2 
М == 50 


GRAPHICAL REPRESENTATION OF DISTRIBUTIONS 
A distribution may often be portrayed more clearly by means of 
a graph, either of the type known as a histogram, or by a frequency 
polygon." In both of these the frequencies are plotted on the vertical 


012 545 678 9101 1215 14 I5 I6 I7 18 I9 20 x 


Fic. 1.—Histogram of distribution from Ex. 2. 


1 Тһе ogive type of graph, or cumulative frequency curve, will be omitted 
here. The reader may find accounts of its construction and uses in more advanced 
textbooks, such as Garrett (1953) or Dawson (1933). 
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( Y)axis against the measures, or classes of measures, on the horizon- 
tal (Y) axis. The scale of the graph and the location of the origin (the 
point of intersection of the axes) are purely matters of convenience. 
Suppose, for example, that our graph paper measures 6 inches х 9 
inches, each inch being divided into tenths, and that we wish to plot 
the distribution given in Ex. 2. We might use a scale of 3 persons to 
the inch on the Y axis, and one class (3 marks) to the inch on the X 


F 
18 | 


1 1 ——4+ —Á——Ó———— ТЕ 


| 4 7 10 13 16 19 x 


Fic. 2.—Incomplete frequency polygon of distribution from Ex. 2. 
axis. This has been done in Figs. 1-3. Fig. 4 portrays two distribu- 
tions of intelligence test scores which have been grouped into classes 
73-77, 78-82 . . . 133-137. Each inch on the Y axis represents 5 
persons, and each inch on the X axis includes 2 classes (10 points) 
of score. 

A completed graph generally looks best if its largest vertical and 
horizontal dimensions are roughly the same. Its appearance will be 
too peaked if the scale of frequencies is too large, or the scale of 
measures too small, and too flat if the converse holds. 

The origin is often placed at X — 0, but there is no need for this. 
Thus in Figs. 2-3, it is at X — — 2, in order that the Y axis may 
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be drawn at the left-hand edge of the graph paper. And in Fig. 4 it 
is at X- — 100, in order that the Y axis may be in the centre of the 
page. 

The Histogram or Column Diagram.—Fig. 1 is a histogram por- 
traying the distribution given in Ex. 2. It consists of a series of pillars 
or columns drawn side by side. Each pillar represents one measure 
or, as in this instance, one class of measures, and its height represents 
the frequency of this measure or class. The marks from 0 to 20 are 
written along the Y axis, but note that each mark is represented by a 


7 


7 10 15 16 Ig 2x 


Q ——— ro 
4 


Fic. 3.—Completed frequency polygon of distribution from Ex. 2. 


space of + of an inch on this axis, not by one of the inch or tenth- 
inch /ines. The first class (including 0, І, and 2 marks) is, therefore, 
represented by the whole width of the first inch, the second class (3, 
4, and 5 marks) by the whole width of the second inch, and so on. 
The Frequency Polygon.—All the measures included in a class are 
here represented by a point on the graph, whose Y co-ordinate is the 
frequency, and whose X co-ordinate is the midpoint of the class. 
Thus, in Fig. 2, which represents the same distribution as before, the 
midpoints are written along the X axis, and this time they are located 
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either at an inch or tenth-inch /ine. The points on the graph are 
connected up by straight lines. Note that it is incorrect to smooth 
the lines into a curve, like those in Figs. 10-14. The figure is, as it 
were, suspended in mid-air, hence it is usual to complete it (Fig. 3) 
by connecting the first and last points with additional points whose 
Y co-ordinates are zero. The X co-ordinates of these additional 
points are the midpoints of the classes just below the lowest, and just 
above the highest classes, in the distribution, ie. X — — 2 and 
X= 22. 


Ex. 3.—For further illustration there follow the frequency distri- 
butions of scores obtained by 136 men and 114 women students on 
the Cattell Intelligence Test, Scale IIIA. The scores are grouped into 
classes of 5. The distributions for the two sexes are plotted as 
frequency polygons on the same graph (Fig. 4), and for the sexes 
combined as a histogram (Fig. 5). Note in the latter that the scale 
along the Y axis has been halved, since combining the sexes roughly 
doubles the frequencies to be plotted. 


Frequencies 


Test Scores Men Women Combined 
133 + = =. 4 0 1 
128 + . · 0 1 1 
123 + п NS; 1, 6 
18 + ‚ е 6 3 11 
113 + . « B 9 22 
108 4- . n Lh 9 26 
103 + s s 2l 14 35 

98 + s « T 30 47 
93 + Р 23 20 43 
88 + * . 14 13 27 
83 + . & @ 12 18 
78 4- Р a 18 2 10 
73 + а . 38 0 3 

М = 136 114 250 


Relative Merits of Frequency Polygons and Histograms.—Fre- 
quency polygons possess the advantage over histograms that two or 
more distributions can be superimposed and compared on one graph, 
as in Fig. 4. But they are inferior in that the lines connecting suc- 
cessive points give a somewhat inaccurate notion of the distribution, 
especially when the number of points is small. For instance, in Fig. 4 
the connecting line between the two points X — 110 F — 17 and 
x = 115 F = 13, suggests that 15 men students obtained inter- 
mediate scores of 1124, which is quite untrue. In this respect the 
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histogram is unexceptionable, since it gives an exact portrayal of the 
tabulated data. The total area enclosed within a histogram corre- 
sponds accurately to the total number of persons, N. The area within 
a polygon corresponds only approximately. Histograms havea further 
advantage, namely, that they can be employed when the variable is 
not a truly quantitative one. For example, examination scripts are 
often graded, not with numerical, but with letter, marks—A, A —, 
B +, etc. The frequency distribution of the numbers of examinees 
who are awarded each letter can be shown by a histogram. 


„© 1___1+—_—_- = 
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Fic. 4.—Frequency polygons of distributions from Ex. 3. 
Men. ----- Women. 


CHARACTERISTICS OF FREQUENCY DISTRIBUTIONS 
Smoothness of Distributions.—Here we must distinguish provi- 
sionally between ‘objective’ and ‘subjective’ measures, and discuss 
the difference more thoroughly in the next chapter. Objective 
measures are those which are practically independent of the person 
who makes them, such as heights measured with a ruler, or the 
scores on a standardized educational or intelligence test. Subjective 
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Fic. 5.—Histogram of distribution of men's and women's scores 
combined, from Ex. 3. 
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Fic. 6.—Distribution of test scores of a small group. 
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measures, by contrast, depend upon personal evaluations. They 
include, for example, the marks awarded by a teacher to a series of 
English compositions, assessments of pupils’ characters, and the like. 

Now when graphs are drawn of frequency distributions of objec- 
tive measures, certain characteristic features are very commonly 
found. First, the graphs are generally very irregular when the fre- 
quencies are small, but the irregularities tend to become ‘ironed out’ 
as the numbers increase. And when № reaches a value of several 
hundreds, the frequency polygon or histogram approximates to a 
smooth curve. Thus the distribution in Fig. 5 is much more regular 
in its rise and fall than is the distribution in Fig. 6, which represents 
the scores of a group of 50 students, selected at random from the 
previous 250. 

Fig. 7, which is a frequency polygon of the heights of 6,194 male 
adults, gives an almost smooth curve. 
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Fic. 7.— Frequency polygon of heights of 6,194 English male adults. 
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Fic. 8.—Irregular distribution of test scores grouped in small classes. 
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Frequencies may be increased not only by increasing the value of 
N, but also by grouping the measures into larger classes. Figs. 8-9 
show frequency polygons for the same set of 250 measures grouped 
into classes of 3 and 10 respectively. The latter is distinctly smoother 
than the former. 

Types of Distribution Curves.—In actual practice we can never 
obtain measures of a large enough number of persons for the distri- 
bution graphs to become completely smooth curves. Nevertheless, 
it is customary to represent various types of distributions pictorially 
by means of curves of appropriate shapes, to which obtained distri- 
butions approximate more 
orless closely. Forexample, 
we can say that Fig. 10 
represents roughly the type 
of distribution which might 
be expected for the incomes 
of parents of elementary 
school children. The great- 
est number would prob- £500 £1000 £1500 £2000 INCOME 


ably lie between £300 and Fig, 10.—Skew curve, illustrating approxi- 
£800 per annum, and the mate distribution of annual income 


frequencies of those at Amone (parents of elementary school 
higher income levels would 

progressively diminish. This particular type of distribution is called 
a skew curve, because of its asymmetry. It is positively skewed 
towards the upper end, and shows a piling up of cases at the lower 
end. A bunching at the upper end would give a negatively skewed 
curve. 

Fig. 11 is a bimodal curve such as might be obtained if we graphed 
the intelligence quotients’ of pupils in two schools, one containing 
very intelligent children of professional class parents, the other con- 
taining inferior children of very poor stock. Most of the I.Q.s fall 
either between 110 and 140 or between 70 and 90, but there are a 
few of even higher or even lower intelligence, and a few (probably 
drawn from both schools) of intermediate or average intelligence, 
with 1.0.5 of 90 to 110. 

The two polygons which appear in Fig. 4 are irregular on account 


1 It is assumed that all readers know roughly what is meant by intelligence 
quotients (I.Q.s). A definition may be found on p. 19, and a more precise account 


of their meaning on p. 194. 
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Fic. 11.—Bimodal curve, illustrating in- 
telligence quotients of mixed groups. 


of the small frequencies 
involved. They approxi- 
mate, however, to a pair of 
curves such as those shown 
in Fig. 12. Note here that the 
more highly peaked curve 
(representing women’s 
scores) does not indicate 
that women obtain, on the 
whole, higher scores than 
men, but that women obtain 
more scores close to the 


average of 100 points, and fewer extreme scores; in other words, 
that men are more dispersed or spread out in their ability. 

It should be remembered that the areas enclosed beneath distribu- 
tion curves are proportional to the total numbers of cases involved. 


© во эо Оо по Ro ЗО 


Fic. 12.—Two distributions of intelligence 
test scores, with different dispersions, 


In Fig. 12 the areas are 
roughly equal, since the total 
numbers of the two sexes are 
about the same. Fig. 13 shows 
the curves to which the dis- 
tributions of scores among 
Honours andamongordinary 
degree students approximate. 
The latter have, on the 
whole,somewhatlower scores 
than, but are four times as 


numerous as, the former. It is more usual, when groups of such 
different sizes are being compared, to plot percentage frequencies 


то 80 90 
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Fic. 13.—Distributions of intelligence test 


Scores of two groups, one much larger 


than the other but of lower average 
intelligence. 


instead of actual frequencies, 
since the curves will then pos- 
sess the same area, and differ- 
ences in the averages and 
ranges or dispersions of meas- 
ures will stand out better. 


THE NORMAL FREQUENCY 
DISTRIBUTION 

Whenever a large group of 

human beings is taken at ran- 
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dom (i.e. not specially selected like the two schools whose I.Q.s are 
graphed in Fig. 11), then objective measures of almost any trait or 
ability they possess—physical or mental—will be found to conform to 
a certain type of distribution known as the normal curve. This is often 
referred to as the normal probability curve, the normal curve of 
error, or the Gaussian curve. It is symmetrical, bell- or cocked-hat- 
shaped. It indicates that the majority of persons in the group obtain 
measures round about the average, relatively few obtain extreme 
measures, and that the frequencies above the average are the same 
as the frequencies below it. The four curves in Figs. 12-13 belong to 
this type, and all the graphs from Fig. 4 to Fig. 9 roughly approxi- 
mate to it, since they represent distributions of scores obtained by 
unselected groups of persons. Innumerable instances of distributions 
which tend to this normal shape could be cited: the heights of chil- 
dren in an ordinary school class, the numbers of times they can tap 
with one finger in a minute, or the numbers of words they can write. 
Variations in individual performances as well as differences between 
individuals show the same trend: if a person tries a hundred times 
to copy a line of a certain length, most of his attempts will be close 
to his own average, and a few will be considerably too long or too 
short. Differences between groups are also similar: if all the ele- 
mentary school classes aged 11 + in a large city are given an objec- 
tive arithmetic test, and the average mark of each is calculated, then 
a graph of all these averages may be expected to show a normal 
distribution. 

Conditions which Upset the Tendency to Normality.—There are 
several factors which may disturb this normal tendency. First, the 
total frequencies involved may be so small that the irregularities, 
noted above, may mask it. It is usually apparent, however, even with 
groups of 50 or less (cf. Fig. 6). Secondly, the group of persons may 
be selected in such a way as to distort an otherwise normally distri- 
buted variable. For example, a school class which has been ‘creamed’ 
will be likely to show a negatively skewed distribution of test scores 
or examination marks, since there will be a deficiency of pupils of 
high ability. The police force or the army will not show a normal 
distribution of heights, if no recruits are accepted below a certain 
height. ES И 

Thirdly, there are а few human characteristics which, although 


objectively measurable, are known to be abnormally distributed. 


Accident proneness is one such variable. It is found that a few people 
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F are liable to undergo a large 
number of accidents in 
any given period of time, 
whether at home, or in fac- 
tories, or when driving cars. 
But the majority have few 
ог no accidents.' Thus the 
distribution takes roughly 

трт 4 the form shown in Fig. 14. 

IN A CERTAIN PeRIop Annual income also gives а 

Fic. 14.—Approximate frequency distribu- Skew distribution, and it is 

tion of accident proneness. likely that other measures 

of 'productivity such as 

literary or scientific output are similar (cf. Burt, 1943). It is hardly 

possible to prove that intelligence quotients conform to the 

normal curve, though this is a convenient assumption that accords 

reasonably with commonsense observation. It does break down at 

the bottom end of the scale, since there are more imbeciles and idiots 

of very low І.О. in the population than would be expected. Some 

investigators have claimed other marked divergences from the nor- 

mal shape, but these might be attributable to inequalities in the units 

of measurement. 

Such irregularities, or defects in our methods of measuring vari- 

ables, constitute a fourth condition, which is of very frequent 
occurrence, as we shall see in the next chapter. 


Cf. Н. M. Vernon (1936) 


CHAPTER II 


PRINCIPLES OF MENTAL MEASUREMENT AND 
MARKING 


WE are all accustomed nowadays to regarding a scholastic capacity 
(e.g. arithmetical ability), like height or weight, as ranging from a 
very small amount in some people to a very large amount in others, 
and to assessing these amounts in numerical units such as percentage 
marks. General intelligence and other aptitudes can similarly be 
graded as quantitative variables, average intelligence being generally 
denoted by an Intelligence Quotient or 1.0. of 100, superior or 
inferior intelligence by larger or smaller numbers. Yet the conception 
of scaling mental qualities in a manner similar to physical ones is of 
fairly recent origin—we owe it largely to the work of Sir Francis 
Galton—and it is still somewhat limited in application and beset 
with many difficulties. A human ability or trait is far more complex 
than a length, a weight, or a time; it can never be completely repre- 
sented by a single numeral on a scale. Consider, as an example, 
political opinions—a communist is certainly more socialistically in- 
clined than a liberal, and a fascist is more reactionary than a соп- 
servative. Successful tests have, therefore, been devised which scale 
these attitudes as a single variable ranging from extreme left-wing to 
extreme right-wing. But such a scale can hardly do justice to all the 
various shades of political opinion; it inevitably involves some over- 
simplification and artificiality. Particularly in the field of tempera- 
ment and character does such distortion arise, for here people appear 
to us to vary from one another, not merely in amount, but also in 
kind or quality.! The same is true, though to a lesser extent, in the 
field of abilities. Thus the precise nature of intelligence is an extremely 
controversial matter, and the common tendency to regard the I.Q. 
as an accurate measure of some definite and stable mental entity is, 
as we shall see below, an egregious misinterpretation. 

No mental trait or ability can be measured directly by laying some 
kind of ruler alongside it. Instead we have to grade it by observing 
its expression in words or actions. Arithmetical ability, for example, 


1 For a fuller discussion of these points, cf. Vernon (1953). 
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is shown by success at a series of sums; moral character by certain 
modes of behaviour. Indeed, for scientific purposes, we have to regard 
a trait or ability, not as some power or faculty in the mind, but as a 
particular class or category of related performances, e.g. of the 
arithmetical or the moral kind. And special statistical techniques are 
required for deciding just what performances should be included 
under one heading, as representative of one ability or trait (cf. Chap. 
VIII). Since we cannot, of course, observe and record all the per- 
formances which go to make up a mental characteristic, we are 
forced to take samples of them. Our procedure, as Binet pointed out, 
is analogous to that of the mining engineer who cannot examine all 
the ground in a given locality, but instead takes borings at various 
points, and from these samples deduces with reasonable accuracy 
what the district as a whole is like. Similarly, the psychologist 
measures intelligence, or the teacher measures achievement in arith- 
metic, on the basis of the child's success at sample intellectual and 
arithmetical tasks. An arithmetic examination sets, perhaps, six 
questions, the answers to which constitute a sampling of the general 
field of arithmetical knowledge which the pupil is supposed to have 
acquired. 

How, then, is success at such sample performances expressed in 
numerical terms? The various methods of marking or scoring mental 
tests and examinations may be classified as follows: 

A. Enumeration or count Scoring. 

B. Ranking. 

C. Qualitative grading and classifying. 

D. Numerical evaluation. 

E. Mixed numerical evaluation and count scoring. 


A. ENUMERATION OR COUNT SCORING 

This type of marking is the only objective type, in the sense that 
each person's mark is quite independent of the marker's subjective 
evaluation of his work. When, for example, a series of simple arith- 
metic sums is set, the answers to which are all either right or wrong, 
or when one mark is deducted for each mistake in a dictation test. 
then a pupil's total mark is a count or enumeration score. Since this 
method is applicable only when no doubt can exist as to the correct- 
hess or incorrectness of the answers which are counted up, it is 
largely restricted to very simple kinds of scholastic work. We shall 
see later, however (Chap. XII), that attainments in more advanced 
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and complex school or university work can be fairly adequately 
measured by sets of brief, objectively marked questions. The great 
majority of standardized educational and general intelligence tests 
follow this method, so as to eliminate the personal element in scoring. 
Steel and Talman (1936) claim that even English compositions can 
be marked by counting all the good and bad points of vocabulary, 
sentence-structure, and sentence-linking which they contain. The 
Scoring is objective also among what are known as performance tests 
of intelligence or of special aptitudes. The child is set some practical 
or manipulative task, and a record is obtained, either of the number . 
of pieces successfully put together or of the number of seconds taken. 
This type of marking includes two distinct methods of measure- 
ment which may be termed the rate method and the power method. 
А test or examination may contain a number of fairly simple tasks, 


all of approximately uniform level of difficulty. The score is then the б g ; 
number of these tasks accomplished within a certain time limit, Ф | 
Alternatively, it may contain tasks which are graded in difficulty % : | 
from quite simple to very hard tasks, such that all persons being i ic: 
tested can answer some of them, and scarcely any or none can; Э | 


manage all of them. Here no time limit is necessary. The score 18:22 
again based on enumerating the tasks successfully completed, since 
those whose ability, knowledge, or ‘power’ is greater are able to 
answer tasks of a higher level of difficulty. These two methods haves 
aptly been compared to a hurdle race and a high-jump competition 
respectively. The table of tests given in Chap. IX lists several 
examples of both. Thus, Ballard’s One-minute Reading and Arith- 
metic Tests and Burt’s Reading Fluency Test beiong to the former 
type. Burt’s Graded Vocabulary Tests of Reading and Spelling and 
the Terman-Merrill Intelligence Test belong to the latter. Most tests 
and examinations, however, combine the two, i.e. they include easy 
and difficult items, but also impose a time limit, so that the score 
depends on power and rate of working. 

Comparison of Mental Count Scores with Physical Measures.—We 
must ask now whether such count scores can legitimately be regarded 
as measures of mental abilities, analogous, say, to measures of height 
and weight or of physical abilities such as rate of running, power of 
muscular contraction, and the like. Actually the numbers in terms 
of which abilities are expressed show several important differences 
from numbers such as inches on a ruler or pounds on a weighing 
machine. The latter, physical, units are all equal to one another at 
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any point on the scale. E.g. the difference between 50 and 49 inches 
is precisely the same as the difference between 40 and 39 inches. 
Moreover, they always have a zero point: e.g. 0 lb. means no weight 
at all. Mental ability units often lack these features, and this fact 
precludes us from adding, averaging, or otherwise treating them as 
if they were true numbers. 

For example, a one-year-old child might fail to pass any of the 
Terman-Merrill Intelligence Tests, but his zero score shows, not that 
he possesses zero intelligence, but that the tests are an inadequate 
measuring instrument for this purpose. Again, a child with an 
1.О. of 120 cannot be considered twice as intelligent as one with 
an 1.0. of 60, because we do not know what is the zero point of 
the I.Q. scale. Even in a rate test of, say, reading speed, a child who 
reads 50 simple words in a minute, and so scores twice as much as 
one who reads 25 words in the same time, cannot properly be said 
to be twice as good at reading. 

It is obvious that the units in a rate test, such as words read, can 
never be precisely uniform in difficulty. Similarly, the various errors 
in a dictation test may vary considerably in seriousness. It should be 
noted also that two children who obtain the same score on a rate 
test do not necessarily accomplish precisely the same items, since one 
may fail on, or omit, certain items which the other passes; and they 
may each make the same number of mistakes in the spelling of a 
certain passage, many of which are different. Our tests, then, might 
be compared to a weighing machine with a large number of pound 
weights, some of which are slightly heavier, some slightly lighter, 
than a pound. When using the machine to weigh children we do not 
always pick out precisely the same set of weights. Now, in spite of 
these defects in the weights, we can claim that two children, both 
found to be 70 Ib., are very nearly the same weight, provided that 
the irregularities among the pounds are small. And if four children 
are found to weigh 90, 80, 70, and 60 Ib., we can claim that the 
difference between the first two is likely to be nearly equivalent to 
the difference between the last two. Thus this analogy indicates that 
when the number of items in a mental test is large, and irregularities 
are small, the mental units may be regarded as approximately 
equivalent. 

In tests of the pure power, or combined rate and power, types, the 
items are not intended to be equal in difficulty. Instead, each item 
should exceed the previous one in difficulty by approximately the 
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same amount. For example, each test in the Terman-Merrill Scale 
(from Year VI to Average Adult level) is supposed to represent an 
increment of two months of mental age.’ But it is generally admitted 
nowadays that mental age units are not equivalent, and that the 
difference between a child of M.A. 133 and one of 13 is probably 
much smaller than the difference between children of 64 and 6. Here, 
then, it is as if we ran out of (nominal) pound weights after a certain 
point, and had to use further weights which progressively increased 
in heaviness. Obviously this leads to grave difficulties in the mental 
age method of scoring (cf. Chap. IV). Several group intelligence 
tests, however, do manage to maintain approximately equal incre- 
ments throughout. 

The units of measurement of many physical skills present similar 
problems. That equal increments of score do not necessarily represent 
equal increments of skill can be seen from the example of golf scores. 
It is decidedly easier for a player who requires 140 strokes for a round 
to improve to 130, than it is for a player who takes 70 to improve 
to 60 (or even—if we could assume golf scores to follow a geomet- 
rical rather than an arithmetical progression—to 65). Here, also, it is 
impossible to establish a zero point of ability, or to calculate ratios 
between abilities. 

Special techniques have been devised by Thorndike, Thurstone, 
and others for the production of measuring scales with equal units, 
and for determining the zero points on such scales. They are too 
complex for description here. But they depend in the main upon the 
principle of normal distribution of objective measures, which was 
enunciated in Chap. I. Well-constructed tests always tend to yield 
a normal frequency distribution; hence we are entitled to regard 
departures from normality as indicative of bad construction or 
bad scoring. For example, a mental test which contains plenty 
of easy and moderately difficult items, but whose later items in- 
crease too rapidly in difficulty to allow sufficient headroom, will 
give a negatively skewed distribution. The analogy with a weighing 
machine would be as follows: suppose that the available weights 
beyond the seventieth became progressively bigger than 1 1b., and 
that a large group of children were measured whose true weights 
averaged about 70 Ib., and ranged from 50 to 90 Ib. The distribution 
would then be distorted from normal to negative skew type, as in 
Fig. 15. Conversely, a test which contains insufficient easy items, so 

1 For a definition and discussion of mental age, cf. pp. 69-73. 
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that it only differentiates 
effectively among the better 
pupils, is likely to show a 
positively skewed distribu- 
tion. 

Application to School 
Marks and Examinations.— 
It is difficult to apply this 
39 160 7o 80 90 principle to school marks, 


Fic. 15.—Normal distribution (a), altered since the groups are usually 


to negative skew curve (5), by increasing 
tlie;size of the higher Uni. so small (less than 50) that 


great irregularities in distri- 
bution are sure to arise. The following table, however, gives the marks 
obtained by two 11-++ classes in an ordinary school on an objective 
arithmetic test. Owing to the deficiency in difficult questions, many 
pupils at the top got full marks (80), and the real extent of their 
ability was not tested at all. 


Ex. 4.—An abnormal frequency distribution of arithmetic marks. 


Mark F 
80 $ х ё „ 1$ 
10+ a А У « 6 
60+ . | " é 6 
50+ . . x . 14 
40+ . à 5 4 17 
30+ . А 4 x 8 
20+ . " ‘ . 4 
10+ . А А e 8 

0-4 . ә А s 24 


N = 78 

Further problems occur in the interpretation of count scores. 
Physical measures, such as 60 inches or 15 seconds, always represent 
definite amounts. No doubt exists as to their largeness or smallness, 
because an inch or a second has an absolute, standard size. Laymen 
and teachers often wrongly suppose that the same is true of mental 
measures, such as the mark awarded to a pupil’s work. They talk of 
15/20 or 60% as if these numbers implied the same level of achieve- 
ment whenever, or by whomsoever, they are awarded. Yet it is surely 
obvious that such scores, by themselves, tell us nothing at all about the 
goodness or poorness of achievement. Such statements as: “Johnny 
is doing well at arithmetic. He got 8 sums out of 10 correct” 


| 
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or “His dictation is terrible; he made 15 mistakes"—are quite 
meaningless unless we also know the difficulty of the sums and of the 
passage dictated. Johnny may be the poorest arithmetician in his 
class, and yet score 8/10 if sufficiently simple sums are set. Or he 
may be the best speller in his class, and yet make 15 mistakes if the 
passage happens to be taken, say, from a textbook of organic 
chemistry. The speaker is, of course, implying that the tasks are of a 
level of difficulty such that most of the class score less than 8 and 
— 15. In other words, the marks possess only a relative, not an ab- 
solute, significance. Again, when an inspector or head teacher com- 
plains that the average examination mark of a class is *too low,' he is 
presumably comparing the mark and the difficulty of the examina- 
tion with subjective recollections of the achievements of other 
similar classes. Undoubtedly there is a great deal of loose thinking 
about these matters, and much can be learnt from the more sys- 
tematic procedure of the scientific mental tester. 

The tester realizes that the goodness or poorness of a child's test 
performance can be interpreted only by comparing it with the per- 
formances of other similar children. Hence he applies his test to 
large numbers of children, similar as to age and other relevant 
circumstances, scores their performances all in the same way, and 
can then state definitely whether a certain performance is superior or 
inferior, and how much better or poorer it is than the average. This 
procedure is called the establishment of norms or standards for the 
test. A discussion of the main types of test norms is given in Chap. 
IV, where it is also shown how the adoption of analogous methods 
may greatly help in the interpretation of school and examination 


marks. 


B. RANKING 

Ranking means listing a group of pupils or students in order of 
their attainment, Ist, 2nd, etc., down to bottom. This method may 
be applied to any form of scholastic work, including composition, 
handwriting, and the like, which are scarcely amenable to count 
scoring. It is actually the soundest of all forms of marking, in spite of 
its many limitations, because these limitations are so obvious and so 
easily recognized. The first drawback is that the number of persons 
who can be ranked is rather small. Even the arrangement of twenty 
in order involves a good deal of uncertainty, especially near the 
middle of the list. Those at the extremes are usually more readily 
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discriminated from one another. Next, it is clear that these ranks do 
not provide a true numerical scale, since the numbers 1, 2, 3, etc., are 
by no means equally spaced. Usually the differences between the 
attainments of Nos. 1 and 2, 2 and 3, 18 and 19, 19 and 20 (out of 20) 
are greater than the differences between Nos. 9 and 10, 10 and 11— 
hence the difficulty, just mentioned, in ranking the medium pupils. 


F 
——' 
12 X 6 18 
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Fic. 16.—Rectangular frequency distribution given by a rank order. 


The frequency distribution is rectangular (Fig. 16), and is entirely 
unlike the normal curve. These numbers, therefore, may give quite 
false impressions if they are added, averaged, or otherwise manipu- 
lated. For example, a pupil who is first in one examination and third 
in another should be regarded as better than, not as equal to, a 
pupil who is second in both (cf. р. 59). 

A third main drawback is that these numbers are always relative 
to the particular set of persons who have been ranked. Thirtieth 
place is not an unequivocal number like 30° on a thermometer. For 
30th out of 32 is much poorer than 30th out of 50. Moreover, if any 
one person higher on the list should happen to be absent, the 30th 
rank would change to 29th. It is not even analogous to, say, 30 mis- 
takes in a dictation test, since such a test can be applied to large 
numbers of children, and norms can be collected which will tell us 
how low is the level of ability to which 30 mistakes corresponds. In 


contrast, a rank position can never be so interpreted with reference to 
outside standards. 


C. QUALITATIVE GRADING AND CLASSIFYING 
Examination scripts, essays, etc., are often merely classified into 
some three to ten grades which are distinguished by letters (A, А —, 
B +, etc.; o, В, etc.) ог by names (Credit, Pass, Fail; Ist class, 2nd 
class, еїс.; Excellent, Very Good, Good, etc.). This is a more 
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primitive type of grading than numerical evaluation (e.g. on a 01010, 
ог a percentage, scale), though it is of course often applied to scripts 
which have already been awarded percentage marks. For instance, 
examinees whose mark is 80% or more are informed that they have 
obtained First Class or Credit, and so on. 

Since these grades make no pretence of being numerical measures, 
we need not cavil at the eccentricity of their frequency distributions. 
They are, however, liable to another, still more serious, defect which 
is not found among count scores or ranks, namely that different 
teachers or examiners often employ widely different scales or stan- 
dards of grading. Although they are supposed to represent absolute 
evaluations of achievement, they are actually found by experimental 
investigation to involve such gross discrepancies that neither pupils, 
parents, nor teachers can hope to interpret their significance correctly. 
For example, Hartog and Rhodes (1935) arranged for 48 School Cer- 
tificate French scripts to be graded independently by six experienced 
examiners. One of the examiners awarded 46 credits, 2 passes, and 0 
failures. Another gave 17 credits, 12 passes, and 19 failures. The 
remaining four were intermediate. Similarly, Boyd (1924) had 26 
essays by 11 + children graded by 271 teachers as either Excellent, 
VG +, VG, G +, G, Moderate, or Unsatisfactory. Some of the 
teachers awarded one or other of the two highest grades to as many 
as 15 of the essays, others to none of them. Some also awarded one or 
other of the two lowest grades to 17 of the essays, others to none of 
them. In view of such results, it is clear that there is no agreed ` 
meaning as to the absolute amount of ability which these gradings 
should signify. A mark of Excellent may represent a high level of 
ability if it is awarded by a teacher who very seldom gives Excellents. 
It may mean little better than average ability if it is awarded by a 
teacher who habitually grades nearly half his papers as Excellent. 
In other words, this type of marking is actually a relative one, like 
ranking. A mark of G +, or B—, or Pass, etc., only achieves any 
meaning if we know what proportions of candidates obtained better 
or poorer marks, in just the same way asa rank position of 30th can 
only be interpreted if we know how many candidates were ranked 
lower than 30th. 

Qualitative grading should then be regarded as a form of ranking 
where, instead of there being as many different ranks as there are pupils 
or examinees, there are only some 3 to 10 ranks, any one of which 
may be awarded to several pupils whose work is of approximately 
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equal merit. Its grave defect is that any one script is not usually 
compared directly with the scripts of a particular group of pupils be- 
fore receiving its grade; it is merely compared with the marker's more 
or less vague recollections of the scripts which he has marked in the 
past. The only possible rational meaning for Excellent or First Class 
is that the script which receives this grade seems to him to rank 
higher, or to be relatively better, than almost all the scripts produced 
by similar pupils which he has come across. The other grades are 
similarly referred to rough subjective standards which he has set up 
during his previous marking experience. And since his experience 
naturally differs from all other markers’ experiences, so will his 
standards differ, unless he has taken the trouble to discuss and com- 
pare them with others, and to try to adjust them accordingly. 
Finally, as in the case of rank orders, there is no possibility of 
exact comparison of such grades with external standards or norms, 
and they cannot be added or averaged in any satisfactory way. 


D. NUMERICAL EVALUATION 
and 
E. MIXED NUMERICAL EVALUATION AND COUNT SCORING 
Numerical evaluation consists, nominally, in assigning absolute 
numerical grades to pupils’ abilities, school work, or examination 
scripts, either on a 0-10, 0—20, percentage, or some other scale. It 
exhibits all the ambiguities of the qualitative grading just described, 
and possesses several additional defects of its own. In other words, 
although it is the most commonly used, it is quite the worst type of 
marking. Unfortunately, since it involves numbers, it is frequently 
confused with the count scores obtained from objective marking 
(Type A). Teachers assume that the label of, say, 15/20 which they 
award to an English composition is equivalent to a score of 15 cor- 
rect arithmetic sums out of 20. Often the two become inextricably 
mixed. For instance, partial credits involving subjective judgments 
of merit may be awarded to partially right sums. Or an examination 
in history, science, languages, etc., may include, say, five questions 
each of which is marked subjectively out of 20, the marks then being 
added as if they were count scores to yield percentage t 
analytic plan of marking may be adopted whereb 
counted for each correct fact or idea, and so man 
subjective impressions of style or general merit. 


+ When the group of pupils is a lar; 
rational scores; cf, Dawson (1933), 


otals. Or some 
y one mark is 
y are added for 


ge one, their grades can be converted into 
pp. 91-3, 
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Distributionsof such marksaresometimes smooth and symmetrical, 
but are often wildly irregular or skewed. In an examination where a 
pass mark of, say, 50% has been fixed, it is common to find very 
large numbers of 50's, 5175, and 52's, but no 45's to 49's. Some 
teachers show peculiar idiosyncrasies, such as avoiding certain marks 
altogether. For instance, they may award 9, 8, 6, 5, or 3 out of 10 
very often, but scarcely ever give 10, 7, 4, 2, or 1. Two illustrations 
may be given of distributions among students or pupils who also 
took an examination which was objectively scored. 

Ex. 5.—Distributions of (Subjective) Marks for Reading and 
(Objective) Marks for Geography. 

Mark Reading F Geography F 


10 24 2 
9 25 3 
8 13 8 
7 6 10 
6 4 11 
5 2 9 
4 0 9 
3 0 7 
2 0 10 
1 0 4 
0 0 1 

74 74 


The reading marks in Ex. 5 are fairly typical of the distribution 
adopted by teachers in evaluating this subject, or composition, or 
handwriting. Note that this teacher was not, as she supposed, mark- 
ing out of 10, but out of 6, since she only used 6 different grades. 
The fact that about one-third of the class get 10 implies that the 
work of all these children was equally good, which is certainly not 
true. The geography marks awarded by the same teacher were 
derived from a new-type test paper, and so give a roughly normal 
distribution. It is surely obvious that 5/10 for geography does not 
have the same significance, nor represent the same achievement, as 
5/10 for reading. In the one it means approximately the average for 
the class, in the other it means right at the bottom of the class. But 
the children themselves and their parents are likely to assume that 
5/10 always means the same thing. у 

Fig. 17 compares the distributions of percentage marks obtained 
by a large group of students on a new-type examination and on 
an ordinary lexamination of the essay-type. The former, though 
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Fic. 17.— Distributions of percentage marks from (а) an objective examination, 
and (5) a subjectively marked examination. 

somewhat irregular, tends towards the normal, symmetrical shape, 

and includes marks ranging from 19% to 86%. The latter is skewed, 

and it ignores the bottom half of he available scale of marks, since 

it ranges only from 50% to 90%. 

Education authorities would surely distrust a school doctor the 
weights on whose weighing machine varied, say, from 3 Ib. to 1} Ib. 
each. Yet the units implied by the distributions which many of their 
teachers employ are probably quite as variable as this, unless some 
scheme of rationalization is applied (cf. Chap. IV). It is notorious 
also that some faculties in a university use quite different scales of 
marks, or are much more lenient than other faculties. 

It must be obvious that such marks are not true measurements. 


60% does not represent $45; of anything whatsoever. True, it lies a 


certain distance between 0 and 100, but as nobody can define what 
is meant either by zero or perfect ability, that fact does not help 
much. Like the qualitative grades of Type C, they are merely labels 
Whose precise size is fixed more or less vaguely by convention. 
Naturally the convention varies in different institutions. 

Numerical grading, like qualitative grading, actually consists of 
ranking the scripts in relation to the marker's subjective past ex- 
perience. It differs merely in that number labels are substituted for 
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verbal ones, and that most numerical scales contain a larger number 
of different labels than do qualitative scales. This latter feature leads 
to still another common defect. When an essay or a single examina- 
tion answer has to be marked on a percentage basis, the implication 
is that the marker can distinguish, and keep in mind, a hundred 
different grades of ability (or 51 different grades if he confines himself 
to a range of 40% to 90%). Actually it has been proved by experi- 
ment that people cannot consistently discriminate more than about 
10 grades, and some would say only 5. 
THE REFORMATION OF MARKING 

The would-be reformer of marks meets with many objections and 
criticisms. Most of them either rest upon ignorance of elementary 
statistical principles, or else *boil down' to the view that the systems 
in vogue at present are firmly established, generally understood by 
all concerned, and work too well and smoothly to be upset. While 
such a reactionary attitude is entirely unwarranted, yet there are 
admittedly many occasions in school marking where the rationality 
of the scales employed matters little, or where an eccentric distribu- 
tion may be an actual advantage. 

Some important examinations do not aim at measuring the 
abilities of all the pupils, but purely at selecting a certain proportion 
of them. For instance, the English secondary school selection exami- 
nation attempts to pick out the best 20% or so of candidates for 
grammar school education. The corresponding Scottish examination 
has roughly the opposite purpose, namely to eliminate the poorest 
pupils who are unfit for advanced secondary school work. Mowat 
(1938) points out that the most appropriate examinations or tests 
in these two instances will be ones which will spread out, or discrimi- 
nate between, candidates whose marks are close to the respective 
borderlines. The former then should yield a positively skewed 
distribution and give plenty of headroom, while the latter should 
yield a negatively skewed distribution and give plenty of footroom. 
The marks of the bottom three-quarters of candidates in the former, 
and of the top three-quarters in the latter, scarcely matter. Hence, 
it might be a waste of time to construct an examination which would 
spread them all out in the proper normal fashion.* 


1 Nevertheless, attainments tests which cover the whole range are commonly 


employed, partly because the position of the borderline varies widely from one 
е it is useful to have accurate educational 


area to another, and partly because 1 1 
quotients to assist in allocating pupils to appropriate streams in any secondary 


school they ultimately attend. 
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Next we can distinguish examinations which are constantly set in 
schools for such pedagogic purposes as ensuring that all, or prac- 
tically all, the pupils have successfully accomplished a certain stage 
in their work, or that they have learnt, say, some French grammar, 
chemical formule, or history notes, set to them for preparation. 
Other exercises are needed, notably in mathematical subjects, for 
practice in the operations which the pupils are in process of acquiring. 
In these instances we may expect distributions to show a strong 
negative skew. The tests will naturally be designed so that the major- 
ity can get full or nearly full marks. The weaklings will be somewhat 
more spread out, but even the poorest may be able to attain half 
marks. Doubtless it is because these abnormal distributions have 
become habitual in classwork that most teachers are so blind to the 
defects in their examination marking. Such tests which are used as 
pedagogic aids should be kept quite distinct from tests or examina- 
tions designed for purposes of measuring abilities. When the object 
is to compare different pupils or to find how able a pupil may be in 
different subjects, then they should certainly be constructed and 
marked according to the principles outlined below. Further, when a 
standardized (published) test of intelligence or attainment is to be 
given to a class or to a group of students, it should always be chosen 
so as to yield a proper frequency distribution (cf. pp. 188-9). 

Rationalization of Marking.—Marking will probably never become 
scientific so long as it is confused with value judgments. Thus the 
first point to realize is that marking is a process of differentiating 
between the various levels of ability existent among the scripts which 
are being marked. Either it is a process of arranging them in rank 
order or of assigning them to one of a limited number of qualitative 
or numerical grades. But it is not a process of deciding whether all, 
or some, of the scripts are ‘good’ or ‘bad,’ whether the pupils ‘ought 
to have done better’ or ‘have tried hard,’ whether they ‘pass’ or ‘fail.’ 
All such decisions should be postponed until the actual marking is 
completed. 

Often it is desirable that the examination should be of the new- or 
objective-type. The reasons for this advice appear in a later chapter, 
but one of the main ones is that marking is greatly simplified and 
rendered far fairer. In such examinations difficulties arise, not in the 
scoring (which is purely of the count type), but in the setting. They 
should be of such a length, and contain questions of sufficient ease 
and difficulty, to differentiate among both the poorest and the best 
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pupils, and to yield an approximately normal distribution. Ideally 
the average pupil in the class, or the average examinee, should get as 
near as possible to half marks, the poorest pupil little better than 
zero, and the best pupil little less than full marks. For if no examinee 
scores less than, say, 40%, then it is obvious that the easy questions 
at the beginning are a waste of everybody's time. And if no one scores 
more than, say, 70%, the most difficult questions at the end are also 
wasted. 

Humanitarian Objections.—An immediate, and quite pardonable, 
objection will be raised against this ideal, namely, that if marks range 
from nearly 0% to nearly 100%, the poorer pupils will become 
greatly discouraged when they habitually obtain marks of around 
10% to 20%, and their parents will become bitterly resentful. Some 
teachers are so conscious of the stimulating effects of success and 
the depressing effects of failure, that they deliberately distort marks, 
giving the duller pupils who have tried hard more than their work 
deserves, and the brighter ones who have not done so well as they 
might less than they deserve. It may be that this kindheartedness 
of teachers towards the dullards has been partly responsible for 
forcing up their distributions of marks into the absurd shapes 
typified by Ex. 5 above. Perhaps equally potent has been their fear 
of the criticisms of head teachers and inspectors if they admitted 
their failure to teach the almost ineducable pupils by marking their 
work on its true merits. Worthy as such humanitarian objections 
may be, they do not constitute an excuse for bad test construction or 
bad marking. But the solution is not difficult. We shall see in Chap. 
IV that the actual average, and range, ог the scale of marks, em- 
ployed do not matter in the slightest, provided that everyone agrees 
to use the same scale. Either, then, the test may contain sufficient 


(useless) questions which are easy enough to encourage the poor 
ionally constructed, as indicated above, 


pupils; or the test may be rati 
so as to yield a range of, say, 4 to 26 out of 30, or 10 to 70 out of 


80; and when the marking is complete the marks can readily be 
converted into some more acceptable and conventional scale, such 
as 40% to 90%, by the methods described in Chap. IV. A third 
alternative is to withhold all numerical grades from pupils and 
parents, since they are so liable to misinterpretation, and instead, 
once the scientific marking is done and is recorded for administra- 
tive purposes, to reconsider the scripts and attach any verbal or letter 
labels to them that appear suitable for pupiland parental consumption. 
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Grading of Complex Educational Products.—Certain types of work 
such as English composition and handwriting usually have to be 
graded on the basis of total subjective impression. The same is true 
of many examination scripts when the answers consist of a series of 
historical, scientific, or other essays. Here all the answers to any one 
question should invariably be marked before the examiner proceeds 
to another question. Whenever there are about twenty or more such 
essays (or other educational products) to be dealt with, resort should 
be made to the quality or product scale method (cf. p. 154). The 
marker should first skim through a fair number of the scripts until 
he has gained a rough impression of their average level and of their 
extreme range of attainment. No account whatever should be taken 
of his feelings that this level is laudably good or deplorably bad. 
What is needed is the actual average of these particular scripts. 
Next he should look through them more carefully and pick out one 
which typifies the average, set it aside, and call it 5. Next the two 
should be taken which appear to be the best and worst of the bunch; 
these should be called either 9 and 1, or 8 and 2. Finally, others 
should be selected which seem to represent levels intermediate 
between 9 (or 8) and 5, and between 5 and 1 (or 2). These should be 
numbered (8), 7, 6, 4, 3, (2). A good deal of revision of first im- 
pressions may be required before a final decision is reached as to 
these nine (or seven) samples. Once they are chosen the marking is 
straightforward. The rest of the scripts are simply compared one by 
one with the samples and each is awarded the mark of the sample 
which it most closely resembles in merit. Should one or two turn up 
which are superior to No. 9 or inferior to No. 1, they may be given 
10 or 0. 

One obvious advantage of this plan is that it enables the marker 
to be self-consistent and to maintain the same standards throughout. 
Another is that it does not demand impossibly fine discriminations. 
The number of samples can be reduced to five, or extended to eleven, 
if desired, or some scripts may be given a mark midway between two 
samples, e.g. 63. But usually seven or nine grades are sufficient. 
Most important, however, is the fact that the final distribution will 
be fairly smooth and symmetrical if the number of scripts is large 
and the samples have been well chosen. Even if their number is small 
there should be a majority of 6's, 5's, and 4's, and a minority of 9’s, 
8’s, 2'5, and 1's; and there should be about as many above as below 
5. The precise marks given to the samples do not matter. They may, 
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for example, equally well be called 90%, 85%, 80% . . . 55% 5075, 
so long as the marker resolutely avoids all temptations to withhold 
90% on the irrelevant grounds that the best scripts in the bunch do 
not seem to him to ђе ‘up to 90% standard.’ It is, however, probably 
better to do the marking of each question on a 0-10 scale. Such 
marks for several questions in an examination can legitimately be 
added. The final totals can then be converted into any desired con- 
ventional scale, just as can the scores from an objective test. 

Other Types of Marking.—It is more difficult to give definite 
recommendations as to the marking of work of mixed type, where 
only part of the merit or demerit resides in points which can be 
counted. In foreign language examinations, history, Science, etc., 
both correct or incorrect facts, and general style, or originality and 
the like, may need assessment. Sometimes these quantitative and 
qualitative features can be graded separately and later combined in 
due proportions. It might be better if, as is suggested in Chap. XIV, 
such papers were not set; either they should be definitely new-type, 
or definitely of the qualitative essay-type. Nevertheless, the main 
requisites of good marking should always be attainable, namely a 
similar distribution and similar average level of marks in all subjects, 
or in all the questions set in any one subject, and an approximation 
of this distribution to normality. 

Still more difficult are the problems of the marker who has only a 
few scripts with which to deal, e.g. a university external examiner 
who is called in to assess half a dozen degree candidates. Even if the 
papers consisted of perfectly constructed new-type tests, they would 
inevitably yield extremely irregular and variable distributions. The 
examiner is, therefore, forced to employ subjective recollections of 
the merits of similar candidates whom he has marked in the past. 
Nevertheless, he should keep the normal curve in mind, for in the long 
run all the scripts that he meets are likely to be normally distributed. 
He should try, therefore, to build up definite mental specifications 
of his 9, 8,7... 2, 1 levels, and attempt, by thorough discussion 

` with the internal examiners, to decide where on this scale the First 
Class, Pass, or other standards fall. 

Some additional points bearing on these plans, and the answers 
to some of the possible criticisms, will be discussed when we have 
studied, in the next chapter, the statistical conceptions of percentiles, 


averages, and standard deviations. 


CHAPTER III 


STATISTICAL METHODS FOR HANDLING 
FREQUENCY DISTRIBUTIONS 


Numerical Description of Distributions.—So far we have dealt with 
various types of distributions of measures almost wholly in verbal 
terms, and have compared one distribution with another mainly фу 
eye.’ More exact study of them necessitates quantitative description 
and definition. We need first a numerical expression for the 'general 
run' of the measures, and, secondly, some index of the range or 
spread of measures. For instance, if we want to know how intelligent 
Johnny Smith is, and ask a psychologist, he will tell us Johnny's 
Score on an intelligence test and, in order to enable us to interpret 
this score, he will add that the normal score for children of Johnny's 
age is so and so, and that Johnny's score falls at such and such a level 
in the distribution. Again, if we ask a teacher what are the heights 
of the children in her class, she might in reply show us a complete 
table of the distribution of heights, or she might tell us with greater 
precision that the average was 50 inches and the range from 42 to 
57 inches. 

Actually the range is not a good index of the way the measures 
are spread out above and below the average, since it depends on only 
two of the measures, those obtained by the extreme pupils who were 
measured. For example, although this class contains children ranging 
from 42 to 57 inches, all the rest might lie between 46 and 54 inches, 
and we would hardly get a fair picture of the spread if we were only 
told that the total range is 15 inches. A much better index is called 
the Standard Deviation—often abbreviated to S.D. or c, Which will 
be described below. 

A. distribution of measures can then generally be defined numeri- 
cally by means of its average and standard deviation. But we shall 
describe first an alternative system which possesses certain practical 
advantages. This is based on what are known as the median, quar- 
tiles, and deciles or percentiles. These terms all refer to certain levels 
or fixed points in a distribution. Suppose the measures to be tabu- 
lated or arranged in order, then the median is the middle measure, 
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i.e. the measure half-way up counting from the bottom, or half-way 
down counting from the top. The lower and upper quartiles, usually 
referred to as О, and Оз, are the measures one-quarter and three- 
quarters way up. The decile measures are one-tenth, two-tenths, etc., 
and the percentiles one-hundredth, two-hundredths, etc., of the way 
up. Stated in a different way, the Xth percentile is the measure below 


which Х of all the measures fall. 


Determining the Median.—Ex. 6.— Consider the following distri- 
bution of only five measures. Clearly the median is 15, that is the 
third measure up from the bottom or down from the top. 


From this example it may be seen that the median is the N + m 


measure, not the Nih. 


Ех. 7.—In the following distributions, N = 8, hence N + L= 4}. 


10 19 
9 17 
8 14 
7 12 
6 9 
5 5 
4 4 
3 1 


чл су буда да со соло 


is half-way between the fourth and fifth measures 
from the bottom. In the first distribution the fourth and fifth meas- 
ures are both 7, hence the median is 7. In the second, the fourth and 
fifth measures are 6 and 7, hence we have to take 63 as the median. 
In the third distribution the data are too sparse for an accurate 
median to be determined. The best we can do is to interpolate and 


call it 104. 
Determining the Quartiles and Percentiles.—The quartiles аге 


similarly obtained by counting the measures which are 4 


Thus the median 


- апа 


= from the bottom. For the distribution given in Ex. 1 (p. 5), 


N — 50, hence the required measures are 123 and 38} up the list. 


M.A.—4 
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The 11th, 12th, and 13th measures are 9, the 37th to 43rd measures 
are all 17, hence О, = 9, О, = 17. 2 

Of the percentiles, the most useful are the 100th or 99th, 90th, 
75th, 50th, 25th, 10th, Ist, or Oth. The 100th and Oth are usually 
the top and bottom measures in the distribution, respectively. The 
75th and 25th are the quartiles, the 50th is the median. The 90th 


percentile or 9th decile is the measure which is ete from the 


bottom. 

Determining Percentiles from Tabulated Data.—When dealing with 
measures which have been grouped into classes, these various levels 
are likely to fall somewhere in the middle of a class. For instance, in 
Ex. 3 (р. 10), the median of 250 measures is the 125311, which lies 
somewhere within the class 98-102. Its position is found by inter- 
polation, and the following formula may be used. 


C/PN 
х = 2 + (100 –5) 


Р = the percentile. 
X, = the measure falling at this percentile level, which we wish to 
find. 
L — the lower limit of the class, somewhere within which X, lies. 
S — the sum of all the frequencies up to, but not including, this 
class. 
F — the frequency within this class. 
N — the total of all the frequencies. 
C — the size of the class. 


Ex. 8.—To find the median of the distribution given in Ex. 3, 
substitute in this formula: 
P = 50. L, the lower limit, = 98. S = 3 + 10 + 18 + 27 + 43 = 


101. 
Е = 47. N = 250. С = 5. 


X, — 98 4 5 (Ош = 101) = 100-5 (to the first place of 


decimals). 

By applying the same formula, we find the 99th, 90th, 75th, 25th, 
10th, and Ist percentiles to be 128, 117, 109, 94, 86, and 77 respec- 
tively (omitting decimals). 

Comparison of Median and Average.—lf the distribution is a 
perfectly smooth and normal one, the median is identical with the 
average. In any fairly regular and symmetrical distribution they will 
be quite close to one another. Thus in Ex. 8, where the median is 
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100-5, the average happens to be 100-68. For rough work the median 
is often used as an approximation to the average, since its computa- 
tion is far easier.! Sometimes, indeed, it gives a better notion of the 
‘general run’ of the measures than does the average. For example, 
the average age of a school class may be unduly raised because the 
class includes two or three pupils much older than all the rest. They 
cannot very well be omitted in calculating the average, but if the 
median is found, they no longer exert a disproportionate effect upon 
the result. Similarly, if a free word association test is applied to a 
child at a psychological clinic, most of the child's responses may be 
given within two to three seconds, but some of them may take much 
longer, and a few words may evoke no response at all. In such a case 
the average cannot be determined, and the median is the most 
reasonable measure of the child's speed of response. 

Here is another instance where the median may be very useful. 
Suppose an investigator wishes to find the approximate average 
speed with which a group of pupils or students can perform a certain 
task, such as handwriting. There is no need for him to time each 
person separately, nor to count the number of words each one has 
written in a certain time, and then add up and average the results. 
Instead, he may set the whole class writing a standard passage, and 
tell them to hold up their hands directly they finish. Then he may 
count the hands raised, and stop the test as soon as the median 
person raises his hand, noting the time at which this occurs. The 
median speed of writing for a class of, say, 41 persons can be calcu- 
lated from the time taken by the 21st quickest person. 

When a distribution is strongly skewed, the median differs 
appreciably from the average, being larger if the skew is negative, 
smaller if it is positive. In Ex. 1 (p. 14), the median is 14, the average 
12-86. This difference is often used as a measure of skewness (the 
formula can be found in more advanced textbooks). 

Uses of Percentiles.—The percentile levels in a distribution clearly 
provide us with as complete a numerical description of that distri- 
bution as we may need. For example, the figures quoted in Ex. 8 for the 
distribution of students’ scores on the Cattell Intelligence Test (the 
99th, 90th, 75th, 50th, 25th, 10th, and Ist percentiles) give a reason- 


1 The main disadvantage of medians, and also of percentiles, is that they cannot 
readily be manipulated statistically. For instance, given the medians of distribu- 
tions A and B, we cannot determine from them the median of the combined 
distribution, A + B; whereas the averages of A and of B will give us the average 


of A + B. 
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ably full answer to the question—how did the students do on the 
test? Knowing these levels, we can begin to interpret the significance 
of any one person's score, since we can tell how he stands relative to 
this group of students. Thus a score of 92 on the Cattell test would 
not be a very good one for a training college student. It falls at about 
the 20th percentile, which means that 80% of the students did better 
at the test, only 20% did worse. But when we are told that the same 
score falls at the 94th percentile for normal adults (not students), we 
realize that it does represent quite high ability. The applications of 
percentiles to the interpretation of school marks and to test norms 
will be considered later. 

Inter-quartile Range.—The range of scores from the 25th to the 
75th percentiles (О, to О») is often called the inter-quartile range. 
It tells us what the middle 50% of the group accomplished. Half of 
this range is often referred to as Q, the semi-interquartile range, and 
this is used as a measure of the degree of spread or dispersion of the 
distribution. When the distribution is smooth and normal, Q is 
usually equal to about one-eighth of the total range of measures. 
In Ex. 3 (p. 10) the total range is 61, and Q — — = S ы 7:5, 


"which is one-eighth of 60. 


CALCULATING THE AVERAGE OR MEAN 
More accurate statistical treatment of distributions demands the 
use of the average and standard deviation in place of the median 
and Q. Everyone knows how to calculate an average or arithmetic 
mean—often referred to as ‘the mean'—but most people use ап 
unnecessarily complex method, which is very liable to lead to mis- 


takes. We will therefore outline this and the shorter alternative 
methods. 


(1) The ordinary method consists in adding up all the measures 
and dividing by the number of cases. Expressed as an algebraic 


5 
formula, M = AC M = the mean or average, X = any one meas- 


ure, and 2 = the sum of all the . . . , hence XX signifies the sum 
of all the separate measures. As a rough general rule, this method 
may be used when N does not amount to more than about 20. With 
larger numbers it becomes very inefficient, unless an adding machine 
18 available. 

Q) Often it is quicker to tabulate the measures, to multiply each 
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Х(ЕХ) 


X by its F, and then to sum. The formula now becomes М = N 


Ex. 9.—The method is here applied to the distribution already 
given in Ex. 1. 


| x F FX 
| 20 2 40 
| 19 2 38 
| 18 3 54 
| 17 7 119 
16 4 64 
15 6 90 
| 14 3 42 
13 4 52 
12 2 24 
| 11 2 2 
| 10 2 2 = 
9 3 27 A hc 
| 8 2 16 Ye Lite Ф 
7 1 7 e - ae 
6 2 12 У x 
5 1 5 € 
4 2 8 чин 
3 0 0 Calcutta > 
2 1 2 
1 1 1 
| М = 50 643 
| 643 
М = 30 12-86 


(3) The process is shortened still further if the measures are 
tabulated in classes. If x is the midpoint of each class, then M — 


(Ех) 
CN 
Ex. 10.—The method is here applied to the distribution given in 
s x x F Fx 
18-20 . = 19 7 133 
15-17 . = 16 17 212. 
12-14 . . 13 9 117 
9-1] . . 10 7 70 
6-8 . « 7 5 35 
3-5 . . 4 3 12 
02 . "E 2 2 
N = 50 641 
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2 (Ех) = 641 
641 
же сш FS 
М = Zy = 1282 


Note that this value of М, 12-82, is not precisely identical with 
that obtained by the previous method, 12-86, though the discrepancy 
is only 0-3%. The reason is, of course, that the figures are somewhat 
distorted by being grouped into classes. We are regarding the 20's 
and the 18's as 19°, the 17's and 15's as 16's, and so on. In the long 
run these distortions tend to cancel one another out, and therefore 
have very little effect upon the final result. They only become serious 
if the grouping is too coarse, i.e. if the number of classes in the table 
is too small. 

(4) The amount of computation, when large numbers are involved, 
may be much reduced by applying the device of the arbitrary or 
guessed mean. The following illustration may explain it. Suppose we 
wish to find the average height of a class of children, we could first 
of all measure each child's height from the ground to the top of his 
head, add up the figures and divide by N. An alternative plan, which 
would do equally well, would be to place each child by a 3-foot table, 
and measure the height from this to the top of his head. We would 
average the measures as before, but would add 36 inches to our 
final result. Again, we might take a still higher table, say 50 inches, 
and measure from this up to the tops of the heads of the larger 
children, down to the heads of the smaller ones. If we added together 
the former, subtracted the latter, divided by N, and finally added 
50 inches, we should, of course, get exactly the same result as before. 
But the figures with which we should be dealing would be far smaller 
than in the first instance. We would be averaging, e.g. + 6, + 1, 
— 2, + 3, — 5, etc., instead of 56, 51, 48, 53, 45, etc. 

The method consists, then, in making a rough guess at the mean, 
listing the amounts by which each measure deviates from (is greater 
or less than) this arbitrary mean, averaging the deviations, and 
finally adding the average to the arbitrary mean. 


3(FD) 
SN tA. 


A is the arbitrary or guessed mean, and D the difference between 
each X and A. 


М = 


Ex. 11.—The method is here applied to the distribution gi i 
:thoc given in 
Ex. 9 (p. 41), and it yields precisely the same result for M as was 
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X D F FD 
20 ж Р 2 14 
19 + 6 2 12 
18 Ss 5 3 15 
17 44 7 28 
16 + 3 4 12 
15 + 2 6 12 
14 + 1 3 3 
13 0 4 + 96 
12 Ed 2 2 
11 = 2 2 4 
10 = 3 2 6 
9 — 4 3 12 
8 =F 2 10 
7 = 16 1 6 
6 = "7 2 14 
5 = 8 1 8 
4 — 9 2 18 
3 — 10 0 0 
2 = 1 11 
1 —12 1 12 

50 — 103 


А = 13. M= #6 — 103 + 13 = — 014 + 13 = 1286 


obtained by the ordinary method of averaging. Note that in this ex- 


ample EM isa minus quantity, and is therefore subtracted from 


A. If we had chosen A = 12, i would have been a plus 


quantity namely 4- 0-86. 
(5) By far the most efficient method when N is large, say 100 or 


more, is to average, not the actual amounts, but the numbers of 
classes by which the tabulated measures exceed or fall short of А. 
For instance, suppose we have a table of the distribution of ages of 
all the inhabitants of a town, grouped into classes of ten years, with 
midpoints of the classes at 5, 15, 25, etc., years. Wishing to calculate 
the average age we take 35 as our arbitrary mean. Obviously there is 
no need to sum the numbers who are 10, 20, 30, . . . years more or less 
than this A. We might just as well sum those who are 1, 2, 3, . .. 
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decades more or less. Then, when we have calculated the average 

deviation from А in decades, we must multiply it by 10 to turn it 

back into years, before adding it to, or subtracting it from, A. 

X(Fd)xc 
N 


The formula for this method is M — + A, where d is 


the number of the class above or below that class of which А is the 
midpoint, and c is the size of each class. 


Ex. 12.— The method is here applied to the distribution given in 
Ex. 3 (p. 10). 


х а E Fd 
133-137 +6 І 6 
128-132 45 1 5 
123-127 +4 6 24 
118-122 +3 11 33 
113-117 +2 22 44 
108-112 T1 26 26 
103-107 0 35 + 138 

98-102 . = ==] 47 47 
93-97 . а —2 43 86 
88-92 . a. —3 27 81 
83-87 . а =4 18 72 
78-82 . „ => 10 50 
73-77 . · —6 3 18 

250 — 354 


The arbitrary mean is taken as the midpoint of the 103-107 class, 
i.e. as 105. Classes above and below this are numbered consecutively. 
5 (Fd) __ + 138 — 354 _ з 
N 250 0-864 
This is the average number of c/asses by which М differs from А. 
We ms multiply by 5 to get the average deviation in scores, since 
g 5. 


M = 105 — 5 x 0:864 = 100-68 
Once the student is accustomed to this method he will find it far 
quicker and more accurate than any other. If in this example he had 
used one of the first three methods, he would have been involved in 
a sum totalling about 25,000. Points which require special caution 


are: (i) making sure what measures are included in a class, and what 
is the size of c; (ii) determining А correctly; (iii) mutiplying zun 
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by c before adding it to 4; (iv) making sure of the sign of this 
quantity before adding it to, or subtracting it from, 4. 


THE MEAN VARIATION 
The Mean Variation or Average Deviation (M.V. or A.D.) is 
occasionally employed as an index of the extent to which the meas- 
ures are spread out on either side of the mean. As its name indicates, 
it is the average of all the differences between each measure and the 


mean (or sometimes the median), regardless of whether these differ- 


ences are plus or minus. 

Ex. 13.—In this very brief distribution, the mean is 5:6. The D 
= = 2:08. If the data were 
table, the same method would 


column lists the deviations. M.V. 
in the form of a frequency distribution 
apply, but the formula would be M.V. = N 


x D 

9 34 

7 14 

6 0:4 

4 1-6 

2 3-6 
N=5 10:4 
М = 5:6 


larger the more widely dispersed or 
spread out the measures. Its value is a little greater than that of Q, 
being generally about one-seventh of the total range of measures. 
(This does not hold in Ex. 13 owing to the shortness and irregularity 
of the distribution.) The М.У. happens to be of very little use for 
purposes of further statistical treatment, hence it is usually passed 
over in favour of the Standard Deviation. 


Clearly the М.У. will be 


THE STANDARD DEVIATION 

The S.D. is also a kind of average of all the deviations from the 
mean, but it differs from the М.У. in that the deviations are squared 
before they are summed, and the square root of their average is 


taken. 
S.D. or o = J 


D: 
N 
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Ex. 14.—The working is here shown for the same short distribu- 
tion as in Ex. 13. 
р? 
3-4 11:56 
1-4 1:96 ZD? 2920 
04 Dip = == =т= 
1-6 
3-6 


х 
RU 


5: o=242 


N=5 29-20 
М = 5:6 
As the mean is seldom a whole number, the squaring of deviations 
from it is troublesome. Generally it is simpler to list the deviations 
from some arbitrary mean which is a whole number, and then to 
apply a correction according to the following formula: 
ZD: 
N 
Note that the correction is invariably subtracted, whether its sign 
before squaring is positive or negative. 


Ех. 15.—In the same distribution, let А = 6, i.e. the nearest whole 
number to M. 


с = 


(M — Ay 


9 3 5 D: 30 
7 1 1 NO. 60 
: 0 g (M — A)? = (5:6 — 6): = 0-16 
2 i 3x оз = 60 — 0:16 = 5-84 
ai c= 2:42 
30 


The same formula may be extended to several different types of 
distributions. 

(a) When the distribution is a short one, and all the measures are 
whole numbers, A may be chosen at zero. The D’s will then be the 
original X’s, so that: 


с = > уя = М? 
Ех. 16— 
1 =x! 
7 49 3 pem M? — 31:36 
1 Р o1= 372 — 31:36. 584 
2 4 с = 2:42 
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This method is normally applied when a calculating machine is 

available, since the machine is not—like the human mind—liable to 
errors when coping with large numbers. 

(b) Often the mean is not known, but is calculated at the same 


Р 2 
time as the S.D. Now М = = + A. Hence an alternative version 


of the same formula is: 


"m J 2D _ =D\? 
N N 
(c) When the measures are grouped into classes, the formula 
becomes: 


| ВР HR / S(Fd)\? 
(AN N 
Here c is the size of each class, and d the deviation of each measure 
from an arbitrary or guessed mean in terms of number of classes. 


Ex. 17.—The formula is here applied to the distribution whose 
XXFd) _ 


mean was calculated in Ex. 12. It was found there that N 


— 0-864. Hence the correction (Ж) = 0:746. 


х Е а а? Fd? 
133 + 1 +6 36 36 
128 + 1 +5 25 25 
123 + 6 +4 16 96 
118 + 11 +3 9 99 
113 + 22 +2 4 88 
108 + 26 +1 1 26 
103 + 35 0 0 0 

98 + 47 — 1 1 47 

93 + 43 —2 4 172 

88 + 27 —3 9 243 

83 + 18 —4 16 288 

78 + 10 — 5 25 250 

73 + 3 —6 36 108 

250 1,478 


X(Fa*) = 1478 = 5-912 


N 250 
о = 54/5-912 — 0:746 = 11:36 


mean and S.D. in one operation. 


Ex. 18.—The following distribution of school marks is grouped 
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One more instance will be given, to illustrate the calculation of the 
into classes of two. Take A at the 11-12 class, i.e. at 11-5. 


х F d Fd Fd? 
19-20 2 +4 8 32 
17-18 5 +3 15 45 
15-16 4 +2 8 16 
13-14 4 +1 4 4 
11-12 2 0 + 35 0 
9-10 7 —1 7 7 
7-8 4 — 8 16 
5-6 4 —3 12 36 
3-4 2 —4 8 32 
1-2 1 —5 5 25 
Х = 36 — 40 213 
2(Fd) _— 5 _ 1 
—N ӨЫ Т 0-139. 
This is in terms of classes. In terms of marks, | 
ЕР) _ . 
= 0-278 
Thus М = 11-5 — 0:278 = 11:22. 
Z(Fd) 213 г 2(Fd)\* — 4, 
-n gg 5917 (202 = 0:019 
о = 24/5:917 — 0:019 = 4-858 
These calculations are quite quickly and easily accomplished with 
a little practice, though some care is needed in checking the arith- 
metic. For example, the whole of the above calculation, using 4- 
figure logarithms and checking by slide-rule, occupied the writer less 
than four minutes, 
The S.D. is always greater than Q or the M.V. This is to be expec- 


ted, since any big deviations from the mean receive much greater 
weight in the S.D. than in the M.V. by being squared before they are 
averaged. Tt generally amounts to one-fifth or one-sixth of the total 
range, or sometimes more if the distribution is irregular, as in Ex. 18. 
This rule enables us to apply a rough check to our result. If in Ex. 18 
we had obtained o = 2 ого = 10, we should have known that there 
had been some mistake in the computation. 


METHODS FOR FREQUENCY DISTRIBUTIONS 49 


The Significance of the S.D.—It was shown earlier in this chapter 
that any distribution of scores could be fairly completely defined or 
numerically described provided we knew some of the main percentile 
levels. Now if a distribution approximates to the normal shape (as it 
should do when obtained froma well-constructed test or examination), 
it can be more simply specified merely by finding the mean score and 
the standard deviation. The reason for this is that the normal curve 
is a regular shape which, like the straight line or the parabola, has 
its own algebraic equation. For instance: y = kx gives a straight line, 
running through the origin, whose slope depends on the size of k. A 
normal curve’s equation is more complicated, but actually involves 


nothing more than x, у, с and certain constants: 


—x 
e^ 


oV 2a 
x refers to any given score expressed as а deviation from the mean, and 
y is the frequency of that score, expressed as a proportion of the total 
number of cases. 

This means that, if we know the S.D. of a distribution, we can 
deduce y for every value of x, that is the frequency of every possible 
measure. For instance, although we cannot measure the intelligence 
quotients of the whole population of Great Britain, yet, if we may 
assume that the I.Q. is a variable which is normally distributed, with 
a mean of 100 and a S.D. of 15, we can state at once just what pro- 
portions of the population have 1.0.5 of 100, 101, . . . etc., or what 
proportion lies below 671.Q., or between 120 and 130 I.Q., and so on. 

Furthermore, there are available in most statistical textbooks tables 


which will tell us the value of ycorrespondingto each valueof z, regard- 


less of what the variable may be (1.О., height, handwriting speed,etc.). 
These are known as ‘probability integral’ tables. The graph given in 
Fig. 18 is based on such tables, but is probably easier to use, and is 


= а по? У 
accurate to two or three significant places. Against = is plotted, not y 


— ће actual proportion of measures which deviate from the mean 
by this amount—but the proportion which deviate by this amount or 
more. Consider for example I.Q. 115. Here x is + 15, and if the S.D. 


is 15, then > = 1. That is, 115 represents a deviation of 1 x c above 
g 


the mean. Now the lowest curve in Fig. 18 tells us that when х =]; 
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= 0-16. In other words, about 0-16 or 16% of the population 
would be expected to have 1.0.5 of 115 upwards. Similarly about 
16% have I.Q.s of 85 and below, since 85 represents a deviation of 
—1 x ofrom the mean. If we enquire how many I.Q.s there are of 130 


upwards, S = +2, and the second curve gives the answer 0-023, ог 
с 


about 2-3%. Note that the same curves сап be used for making 
deductions about any other normally distributed variable, provided 
allowance is made for the size of the S.D. The upper curves are for 


dealing more accurately with larger values of > and smaller values of 
g 


у. Each curve is, as it were, a 10 times magnification of a section of 
the curve below. 


Ex. 19.—In a certain area, with a school population of 25,000, an 
intelligence test with a S.D. of 163 was applied. How many children 
may be expected to have I.Q.s of 120 to 130? And what will be the 
highest 1.0. among the least intelligent 1,000 of these children? 

Intelligence is something which varies continuously ; that is to say, 
every possible value of 1.0. in between 119 and 120 can occur, 
although we usually calculate it only to the nearest whole number. 
Strictly speaking, then, we want to know how many children have 
I.Q.s between 119-50 and 130-50, and to exclude those whose I.Q.s 
are 119-49 or less (who would count as 119's), also those whose 
LQ.s are 13051 or more (who would count as 1317$). Now 119-5 
and 130-5 differ from the mean by 19-5 and 30-5, that is by 1-180 and 
1:850, taking о as 16}. Reading off from Fig. 18, the corresponding 
proportions are 0-119 and 0:0322. Hence the proportion of the 
population within the given limits is 0-119 — 0-0322 = 0:0868 = 
8:68 7$. 8:68 % of 25,000 is 2,170, the answer to our first question. 

In the second question, 1,000 out of 25,000 is a proportion of 
0:040. From Fig. 18, 0-04 corresponds to ап x value of 1:750. 
1-75 x 16} = 28-875. The lowest thousand therefore have I.Q.s of 
100 — 28-875, or less. Hence the answer, to the nearest whole 
number of I.Q., is 71. 

Ex. 20.—Does the distribution of intelligence test scores given in 
Ex. 17 (p. 47) conform to a normal distribution? In other mi are 
I пе similar to the frequencies which would be 

r a truly normal distributi i . 
свя sie iibnticn with the same mean (100-68) 
are test scores, unlike I.Q.s, are discrete. Nobody could score 

3, however carefully his score was computed. If he ne ly, but 

not quite. 1 i т р агу 
n des 5 соор eted 108 items, his score would still be 107. Thus 
D cera о nd what proportion may be expected (according to the 
Es rve) to fall within the 103-107 class, we must determine 
€ proportion with scores of 103 or more and subtract from it the 


:00. 


Tat 


| ка 


5 
»ORTIONATE FREQUENCIES 


METHODS FOR FREQUENCY DISTRIBUTIONS 51 


proportion with scores of 108 or more, nof the proportions lying 
between 102-5 and 107-5. The calculations for all classes are given 
in the following table. For example, 103 is a deviation of 4- 2-32 
from the mean, ог + 0:200; 108 deviates + 7-32 or + 0-65c from 
the mean. From Fig. 18 we find the proportions lying at or beyond 
these two limits to be 0-421 and 0-258 respectively. Hence the pro- 
portion to be expected within the 103-107 class is 0-421 — 0:258 = 
0-163. N is 250, and 0-163 x 250 — 41. The penultimate column 
lists the expected frequencies (omitting decimals), and the last 
column gives the frequencies actually obtained. The two columns 
agree fairly closely, but there are some discrepancies. At present we 
lack a method for telling whether these discrepancies should be 
regarded as appreciable or important, but such a method will be 
described later (pp. 90-1). The full answer to this question will 


therefore be postponed. 
Proportions Proportions Бех- Fact- 


xx lying at or expected pected ually 

x x P beyond these in each ineach ob- 

values of x/o class class tained 

133 + -4-3232-- 4-285 000219 0:00219 1 

128 р + 27:32 +241 0-0078 0-00561 1 1 
1232 4 22:32 +197 0025 0:0172 4 6 
118+ +1732+ +153 0:062 0:037 9 1l 
11324 +12:32+ +109 0-138 0-076 19 22 
108-- + 732-- +065 0-258 0-120 30 26 
103+ -+ 232-- 4-020 0421 0-163 41 35 
98+ — 268-- -—024 0-404 0-175 44 47 
93+ — 768+ —068 0:246 0-158 40 43 
88-- —1268+ —112 0-133 0-113 28 27 
83+ — 17-68 — 1:56 0:059 0-074 18 18 
784 —2268-- — 2:00 0023 0-036 9 10 
73+ — 2768+ —244 0-0073 0:0157 4 3 
68-- — 3268 + — 2:88 000195 0-0073 2 0 


1400000 250 250 


On examining Fig. 18, we see that the proportions whose scores 
deviate by 2-50 ог more are very small, less than 1 in 100, and that 
those who deviate by Зе or more only number about 1 in 1,000. 
Here, then, we have the reason for a statement made earlier, namely 
that the S.D. usually amounts to one-fifth or one-sixth of the total 
range of a distribution. For when Nis about 100, and the distribution 
is a regular one, we shall ordinarily find the measures ranging from 
+ 2-5¢ to — 2:50, i.e. a total of 5c. And when N approaches 1,000 
we may find scores running from + Зо to — Зо, а total of бо. 

There is yet another way of interpreting normal distributions, 
which will be useful to us later, that is in terms of probabilities. 
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Since 0-159 of persons may be expected to obtain measures lo or 
more above the mean, the probability that any one person will 
measure this much or more is 0-159. The probability that he will 
measure less is 0-841. In other words, the odds are 841 to 159, or 
about 5} to 1 against his measure being as high as or higher than 
this. Similarly the probability of a measure as high as + За or more 
is 0-00135, and the chances against it are 740 to 1. This means, for 
example, that we should probably find only one child in an ordinary 
elementary school of about 740 pupils with an 1.0. of 145 or more, 
if we assume the 1.0.5 of school children to be normally distributed 
with a о of 15, since 145 = 100 + 3 x 15. But as many tests have 
considerably larger S.D.s, they will yield a greater number of high 
I.Q.s, and the probability of coming across such children will be in- 
creased. On the other hand the range of intelligence in any one 
primary school is often limited, and this would make the odds 
against high I.Q.s much Stronger. Thus if о is 121, then 150 is 4c 
above the mean, and we can deduce from Fig. 18 that only about 1 
in 31,000 (in schools with an average I.Q. 100) would reach this level. 


CHAPTER IV 


INTERPRETATION OF SCHOOL MARKS AND 
TEST SCORES 


Combining or Comparing Distributions of Marks.—In Chap. II we 
found that the marks awarded to school work or examinations are 
often so erratically distributed that it is impossible to interpret them, 
as they stand, correctly. The first stages in the interpretation of 
marks involve combining and comparing them. For instance, the 
marks for different questions in one examination, or different exami- 
nations in one subject, need to be combined to give a total score. 
Often the marks on several subjects are combined to yield a general 
view of the pupil's or student's all-round ability. Comparison enters 
when we wish to know whether a pupil or student is better on one 
subject than on some other subject (subjects which may have been 
marked either by the same, or by different, examiners or teachers). 
In large-scale examinations where several examiners each deal with a 
proportion of the scripts it is, of course, essential that their markings 
Should be comparable. Sometimes, also, age allowances must be 
made; the marks of younger and older pupils need to be rendered 
comparable by the elimination of differences due to age. 

Now, none of these processes can be carried out directly unless 
the marks to be combined or compared occur in closely similar 
frequency distributions. This is a fundamental statistical principle, 
very elementary in nature, yet very widely neglected. But with the 
aid of the statistical tools described in Chap. III, we are now in a 
position to determine whether such distributions are sufficiently 
similar, and if they are not, to adjust them in such a way as to make 
them so. These tools will further assist in interpretation, since they 
provide a scientific definition of those exceedingly vague conceptions 
—the goodness or badness of a mark or test score, and the ease or 
difficulty of an examination question or test item. 

Distributions may be dissimilar from one another in three respects : 
(1) differences in their averages arise, for example, when one exam- 
iner is far more lenient than another; (2) differences in their disper- 
sions may arise when one examiner spreads out his marks far more 


М.А.—5 
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widely above and below the average than another; (3) differences 
occur in their shapes, e.g. normal, skew, rectangular, etc. For the 
time being we will consider only the most important shape, the 
approximately normal. We may then state that marks can be directly 
combined or compared only if drawn from distributions with the 
same mean and the same S.D. 

The Importance of Equivalent Standard Deviations.—ldentity of 
S.D.s is the more important of these two requirements, though it is 
less often recognized. In many instances of combination of marks, 
the averages do not matter in the slightest. 

Ex. 21.—Suppose that the final marks of a group of pupils or 


students depend on their results in three subjects, A, B, and C, and 
that the distributions are as follows: 


Subject Mean Range S.D. 
A. . 60 40-80 7 
B. . 90 30-70 1 
Є + . 80 60-100 7 


Now, a pupil who is average on all three subjects obtains a total 
of 190/300, and another who is average on two subjects but top on 
the third scores 210/300 whatever the subject. 


Ex. 22.—In contrast to Ex. 21, let us now combine the following 
sets of marks, whose S.D.s differ. 


Subject Mean Range S.D. 
A. . 60 40-80 T 
B . s 50 10-90 14 
б a . 80 70-90 34 


A pupil who is average all round again scores 190. But now a 
pupil who is average in two subjects and top in a third obtains 210, 
230, and 200 according as the third subject is A, B, or C respectively. 
The fact of doing well in subject B (which has the lowest average) 
actually raises the score the most because B has the biggest S.D. 
And doing well in C has the least effect in raising the final total 
(although it has the highest average), because it has the smallest S.D. 
Similarly if we compare the performances of three pupils who are 
at the lower quartile on one subject but average on the other two, 
we find their respective marks to be 185, 181, and 188. Doing badly 
on B has a big effect, but doing badly on C has little effect. 


The general principle, illustrated by these examples, is that when 
Several sets of marks are combined, their relative ‘weights,’ i.e. their 
respective degrees of influence upon the final total, are proportional 
not to their averages but to their S.D.s. If in a final mark we wish to 
give more weight to one subject than to another, e.g. twice as much 
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weight to written as to mental arithmetic, we shall accomplish it, 
not by making the average mark on the former twice the average on 
the latter, but by making the range and S.D. of the former twice as 
large. The usual procedure in such weighting is to multiply the for- 
mer set of marks by two. This may produce the desired effect, not 
because it doubles the average, but because it doubles the S.D. The 
same effect would follow if the average remained unchanged while 
the deviation of each pupil's mark from this average was doubled. 
When the marks derived from several questions in a single examina- 
tion are to be combined in certain proportions, this same principle of 
adjusting the S.D.s for the questions must be borne in mind. 

In order to combine or compare two sets of marks whose distribu- 
tions are dissimilar, we should then express each set in the form of 
deviations from their own means, and multiply one set by a figure 
which will raise or lower its S.D. to the same level as that of the other 
set. For instance, in the two sets B and C in Ex. 21, the ratio of S.D.s 
is 14 to 31, or 4 to 1. The deviations in C will therefore become 
equivalent to those in B if multiplied by 4. Alternatively, each mark 


may be expressed in = form. The top marks on A, B, and C thus be- 


pg 0—50 and "m 5 80 AIL of these equal + 2:86. 
When so converted, they are all equivalent. 

Scores which are expressed in this form are referred to as standard 
scores, or с scores, or sometimes z-scores. As we shall see below, 
they are frequently adopted in scoring mental tests. For example, the 
vocational tester may wish to know whether a man's performance 
on a mechanical test is relatively higher or lower than on a clerical or 
some other test. The tests are likely to have different means and S.D.s, 
but the man's standard scores can legitimately be compared. 

The Importance of Equivalent Means.—Naturally the averages of 
distributions are important under some circumstances. If marks on 
several questions or several subjects are to be combined, and alterna- 
tives are allowed, so that each pupil's final mark is derived from a 
selected number of questions or subjects, then it is quite essential to 
ensure equivalent means. Thus in Exs. 21 and 22, if pupils took either 
A and B, or À and C, those who were at the average level would total 
110 on the former combination, 140 on the latter. Again, when scores 
are to be, not combined, but compared, it is obvious that differences 
in means will entirely upset the comparisons. The adjustment of 


com 
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discrepancies in means is easy—merely add or subtract so much from 
all the marks in one subject or question until its average is the same 
as that for the other subject or question. More often than not, how- 
ever, discrepancies in means are accompanied by discrepancies in 
S.D.s. In that case resort must be made to the technique already de- 


А ; n Р х 
scribed, according to which each mark is expressed as =, ог to some 
g 


modification of this technique (cf. pp. 59-62). 


EQUATING OF MARKS BY THE PERCENTILE 
TECHNIQUE 


Although the statistical treatment of distributions in terms of 
means and S.D.s is far from difficult, it may admittedly lie outside 
the scope of the ordinary teacher or examiner. Further, it cannot be 
used when the distributions are clearly non-normal. For instance, if a 
headmaster, to whom were submitted the marks listed in Ex. 5 (p. 29) 
wished to compare the pupils’ standings on reading and geography, 
he could not use this technique. The alternative, percentile, system 
would, however, be quite suitable for making such a comparison, 
and it is probably simpler for the uninitiated to comprehend. (It does 
not even involve looking up squares and square roots, as does а 
technique based on the S.D.) 

Percentiles will provide a uniform scale in terms of which compari- 
sons may be made between any sets of marks. Whatever the test or 
examination which a group of pupils or students takes, a certain 
percentile level always means the same level of achievement (relative 
to that group of people). For instance, on looking at Ex. 5 we might 
be inclined to suppose that 8 marks for reading represents a better 
achievement than 3 marks for geography; certainly the pupils them- 
selves would infer that this was so. But actually they are precisely 
equivalent, since both marks fall at the 25th percentile, Q,. The 
teacher may, of course, assert that the general standard of reading in 
that class is good and the geography poor. This is an objection which 
will frequently be raised by critics of this chapter, and we will show 

later why we believe it to be baseless. 

Percentile Graphs.—One of the best ways of comparing two sets of 


marks is by a 7-point percentile graph, which is illustrated in the 
following example. 
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Frequencies 

Mark English Arithmetic 
90 + š 5 & J 3 
80 + " " . 10 11 
70 + А E : B 13 
60 + à x = 17 15 
50 + * Д « 8 10 
40 + Р š = Э 2 
30 + « 2 z 1 0 
20 + Р = ; 0 1 

57 57 


Ex. 23.— The table shows the marks of 57 pupils, aged 12 +, in 
a Scottish school on final examinations in English and arithmetic. 
Though the tabulation is coarse, it indicates that the two distribu- 
tions differ in their means and S.D.s, and that, while the first is 
roughly normal, the second is negatively skewed. The procedure is, 
first, to determine the 99th, 90th, 75th, 50th, 25th, 10th, and Ist 
percentiles for each set of marks. From the original lists these are 
found to be 90, 84, 77, 67, 59, 49, 39, and 97, 89, 81, 71, 60, 53, 35, 
respectively. Draw а graph, as shown in Fig. 19, where English 
marks are plotted along the Y axis against arithmetic marks on the 
У axis. Plot these seven points (X = 97, Y = 90; X = 89, Y — 84; 


ENGLISH 
% | 
90 T 


| 


во 


70 


6о 


50 


20 50 40 50 60 70 90 100 
ARITHMETIC % 
Fic. 19.—Comparison of two sets of snl marks by means of a seven-point 
graph. 
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etc.), and connect them up by a straight line or smooth curve which 
runs as nearly as possible through them all. This line may be ex- 
tended at both ends so as to cover the total ranges of marks. From 
it we can now read off the mark on the English scale which corre- 
sponds to each mark on the arithmetic scale, and vice versa. Thus 
41 in English is equivalent to 39 in arithmetic, 80 in English to 85 
in arithmetic. 


Often it is not necessary to determine as many points as seven in 
order to fix the graph. Three points, preferably the 90th, 50th, and 
10th percentiles, may give sufficient accuracy. Note that just the 
same procedure could have been used even if the distributions had 
differed widely in range and dispersion, e.g. if one subject had been 
marked out of 40, the other out of 200.1 

Interpretation ој Marks by Means of Percentile Levels.—The 
results of many standardized tests of intelligence and vocational 
aptitudes are quoted in terms of percentiles, since the original or 
raw scores have little meaning in themselves ; for example—percen- 
tiles relative to 13-year modern pupils, or percentiles relative to R.A.F. 
recruits, and so on. It has sometimes been suggested that the same 
system should be used for school or university examinations. The 
original marks would not be published at all. Average achievement 
would always be represented by 50, and very high or low achievement 
by marks approaching 100 and 0. Though this would be fairer in some 
ways, it would be difficult for pupils, students, or parents to grasp 
that these were always relative to the group which had taken that 
examination. Percentiles are, in fact, precisely equivalent to rank 
orders, except that they always run from 100 to 0 instead of from 1 
to N. The scheme would also not be feasible for examinations taken 
by small numbers, say less than twenty students (cf. p. 66). Never- 
theless it would be a considerable step forward if the mark lists for 
large-scale examinations, at universities, training centres or second- 
ary schools, included a table of the median, quartiles and some of 
the other percentiles for the marks which had actually been awarded. 
Teachers and students would soon learn the significance of these 
terms, and would then be able to see what standards of marking the 
various examiners were adopting. They could also obtain a much 
better idea of relative marks in different subjects than from the pre- 


Sent supposedly absolute marks, which actually vary widely from 
one examination to another. 


Further discussion of this topic may be found in McIntosh et al. (1949). 
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Defects of Percentiles.—Although percentiles are invaluable for 
purposes of comparing distributions, they may be very misleading if 
used for combining distributions. The reason is that percentile units 
are very far from equivalent to one another, since they are far closer 
together at or near the median than they are at or near the extremes. 
It can be seen from the probability ‘integral’ graph (Fig. 18) that a 
percentile level of 90, i.e. the measure above which only 10% of the 


А х 
measures lie, corresponds to an — or standard score of + 1-28. The 
с 


80th similarly corresponds to + 0-84, the 60th and 50th to + 0-25 
and 0-00. Thus the difference between the first two of these, 0-44, is 
decidedly greater than that between the second pair, 0-25. Actually 
the99th, 94th, 78th, 50th, 22nd, 6th, and Ist percentiles are all approxi- 
mately equally spaced, in the sense that a child who is at the 99th 
percentile is as superior to one at the 94th, as the 94th is to the 78th, 
or the 78th to the 50th. The consequence of this inequality of units 
is that, although we may, if we wish, average or otherwise combine 
a person's percentiles on different tests, we shall obtain thereby quite 
different results from those obtained by averaging his scores. 

Ex. 24.— Pupil A's marks on two examinations lie at the 98th and 
54th percentiles; Pupil B's marks on both examinations lie at the 
85th percentile. A's average percentile of 76 is distinctly poorer than 


B's 85. But if these figures are converted into х scores, i.e. into units 
which really are equivalent, it will be found that A is actually slightly 


better than B. For the = values of the 98th and 54th percentiles are 


+ 2-054 and + 0-100, average + 1:077; whereas the х value of the 
85th percentile is only + 1:036. 


It is advisable, therefore, never to use percentiles for combining 
marks. The same is true of rank orders (cf. p. 25-6). 

Adoption of a Standard Scale for All School or Examination Marks. 
— The following solution to this difficulty may be recommended. 
Each chief examiner, headmaster, or other authority who is con- 
cerned with combining, comparing, and interpreting sets of marks 
awarded by sub-examiners, class teachers, etc., should adopt an 
arbitrary standard scale. He might, for instance, choose a percen- 
tage scale ranging roughly from 30% to 90%, with a mean of 60% 
and a S.D. of 10. Now, on such a scale the 7 percentile points 
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STANDARD 
SCALE % 


90 


Е 


80 


70 


60 


50 


| 


40 70 80 9 100 
о 

30 50 60 i 

ARITHMETIC % 


Fic. 20.—Conversion of two sets of school marks into a standard scale. 


mentioned above fall at 83, 73, 67, 60, 53, 47, and 37 respectively. Each 
set of marks submitted by a teacher or examiner should be converted 
into this scale by the graphical method already described. If this is 
done, all the marks from all the markers will be strictly comparable 
with one another, and they will be expressed in units which are 
(sufficiently closely for all practical purposes) equivalent to one an- 
other, so that they can be combined or averaged at will. And there is 
the further advantage that the scale corresponds well to the present 
conventional conception of high and low percentages. Naturally any 
other standard scale can be chosen if preferred, and its percentile 
points determined from the probability integral graph. Fig. 20 shows 
the graphs for transforming the two distributions of Ex. 23 into the 
suggested scale. 97 for arithmetic and 90 for English are plotted 
against 83 (the 99th percentile point) on the standard scale, and so 
on. The standard mark for any other mark can be read off. Thus 81% 


The z values of the 99th and Ist percentiles are + and — 2-326 respectively, 


of the 90th and 10th 


they are + and — 1-282, of the 75th and 25th they аге + 
and — 0:675. 
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in arithmetic becomes 67% standard, and 81% in English becomes 
71% standard. 

Another way of achieving the same result is as follows. By apply- 
ing the method described in Ex. 20 (p. 50), a table may be drawn up 
showing the numbers of cases to be expected within each class- 
interval on the desired standard scale. Thus for a scale with a mean 
of 60 and S.D. 10, the following percentages of pupils should fall 
within successive class-intervals of 5 marks. 


Ex. 25.— 

Mark F% 
88-92 Oorà 
83-87 1 
78-82 3 
73-77 64 
68-72 12 
63-67 17 
58-62 20 
53-57 17 
48-52 12 
43-47 61 
38-42 3 
33-37 1 
28-32 Оогђ 

100 


Teachers or sub-examiners may be given such a table and in- 
structed to award marks to their pupils roughly in accordance with 
the frequencies there shown (precise conformity is not necessary). 
Or, alternatively, the chief examiner or head teacher may take each of 
the distributions of unscaled marks, award a scaled mark between 
88 and 92 to the top 0 or 4% of the examinees, marks between 83 and 
87 to the next 1% of them, and so on. Only very exceptionally good 
or poor candidates—those better or worse than 999 in 1,000 of the 
ordinary run—should be given marks of over 90% or under 30%. 
For in a normal distribution only a proportion of 0:135 % is expected 


x x 
to score more than + 35 or less than — 35. 
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Ex. 26.— 


Frequency 
Mark (a) b. 
20 
Hs Exceptional 
17 cases only 
16 
15 1 
14 3 p 
13 6 3 
12 12 5 
11 18 T 
10 20 8 
9 18 7 
8 12 5 
T 6 3 
6 3 
5 ој 
4 
3 Exceptional 
2 cases only 
1 
0 
100 40 


Burt (1917) recommends an alternative standard scale of marks, 
based on a similar plan, but with an average of 10 and a maximum 
range of 0-20. The scale is given here with the frequencies to be 
expected when (а) 100 and (5) 40 pupils are marked according to it. 
Here also exact adherence should not be demanded. Any one set of 
marks for 40 pupils may depart from it fairly widely. But in the long 
run all the marks awarded by any one teacher or examiner should 


show the same mean, S.D., and tendency to normality as does this 
standard scale. 


OBJECTIONS TO RATIONALIZED MARKING 

There is no doubt that many teachers and examiners will object 
strongly to the principle upon which this chapter is based, that the 
average and the dispersion of sets of marks should always be the 
same in any one institution. Faculties in a university which consist- 
ently award high marks sometimes try to justify themselves by 
claiming that they get a ‘better run? of students than do those which 
consistently award low marks. Examiners regard it as only natural to 
award different average marks to the answers to different questions 
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in one paper, since some questions are ‘done better’ than others. 
Teachers similarly regard their classes as ‘better at’ some subjects 
than at others, and so deserving of higher marks. Again, the class 
they have this year may seem to them an exceptionally ‘good’ or 
‘bad’ one, and the compositions written by the class may be ‘better 
or poorer than usual.’ Finally, those mathematical lecturers and 
teachers who are particularly prone to award very widely dispersed 
marks state that their students exhibit a ‘wider range of accomplish- 
ment’ than do students of other subjects. 

Justification for Equalizing the Average in All Sets of Marks.— 
Many of these claims are extremely dubious. Thus it is more likely 
that the university departments which award high marks will attract 
a poorer, rather than a better, run of students, and that the subjects 
which are marked more strictly will be pursued chiefly by the better 
students who care more for the subjects than for marks. Again, one 
is justified in suspecting teachers’ judgments about the merits of their 
classes, since some teachers appear, year after year, to be blessed 
with good classes, and others to be cursed with bad ones. According 
to what is popularly known as ‘the law of averages,’ one would 
expect the number of good pupils to be balanced in the long run by 
an equal number of bad ones. As a matter of fact, the statistical 
techniques outlined in Chap. V give us a method of predicting just 
how much classes are likely to vary from year to year, or English 
compositions from week to week, owing to so-called fluctuations of 
sampling. Here is one instance. A class of 45 pupils averages 60% in 
their general scholastic work on a scale of marks running roughly 
from 30% to 90%. From these data it may be stated that there is an 
even chance that the average level of next year’s class will lie between 
59% and 61%, and that it is extremely improbable that next year’s 
will differ from this year’s by more than 3% or 4% up or down. 
Consider also the marking of English compositions, or of half a 
dozen or more separate questions in an examination paper. If the 
scale of marks employed ranges roughly from 4 to 20 (S.D. 3), then 
we can predict that the maximum variation likely to occur in the 
average marks for different compositions or questions is 3 marks, 
provided that there are at least 36 scripts, and only 2 marks if there 
are 81 scripts. An individual pupil may certainly range very widely in 
his level of performance on different questions, but the average level 
of a group of pupils is far more stable than is commonly supposed. 
The apparent variations in the goodness of answers are due much 
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less to variations in the pupils than to irregularities in the suitability 
of the questions or of the topics for compositions. But if an examiner 
sets a bad question which is beyond the scope of the majority of 
examinees, or which does not allow them to express their fullest 
capacities, there is no justification for giving low marks. It is there- 
fore much fairer to give the same average mark to the actual average 
script in all cases, as was suggested in Chap. II. 

It is a different matter when a group of pupils or students is known, 
from objective evidence, to be selected in such a way as to be, on the 
average, superior or inferior to the mean of some larger group. А 
special class for scholarship candidates and one for backward pupils 
will naturally show generally good and poor attainments respectively. 
But in many cases the teacher or examiner who claims to have а 
specially good or poor set of pupils or scripts is judging purely on 
the basis of subjective recollections of what he regards as average 
level. When a test or examination is applied to all the pupils of a 
certain age in a school, or in several schools, then the whole group 
should be marked according to the principles already outlined, and 
the marks for separate classes within this wider group may be picked 
out and averaged. Such averages will then quite justifiably show a 
certain range of variation. But when a test or examination is given 
only to a single class, it is far safer not to trust the marker's subjective 
view of the class's general level, but to award an average mark to the 
average script actually obtained from this class. Similarly the aver- 
ages for different school subjects should be the same, regardless of 
whether a class appears to be better at one subject than at another, 
except when objective evidence (derived from the application of the 
same tests or examinations to other, larger, groups of pupils) proves 
that the class's attainments are uneven. 

The ‘Goodness’ or *Badness of a Mark.—We still have to face the 
wider question: what do teachers and examiners mean when they say 
that a pupil's work is good or bad, or an examination paper easy or 
difficult? For endless objections of the type discussed above will be 
raised until this is settled. Most persons who have marking to do 
obviously believe that they possess in their minds some absolute 
standard of goodness and badness by means of which they can 
evaluate a pupil’s or a class's achievement. But numerous experi- 
ments have proved that such subjective judgments of merit or diffi- 
culty, when obtained from several different persons, are often widely 
discrepant. It was for this reason that we were forced to conclude, in 


SCHOOL MARKS AND TEST SCORES 65 


Chap. II, that all marking consists primarily in ranking a set of 
performances in order, and that it is a matter of relative comparison, 
not of absolute placement. The goodness of a child's performance at 
some task can therefore mean only that he ranks high when his 
performance is directly compared with the performances of other 
children at the same task, under similar circumstances. An achieve- 
ment is good if only very few people can do as well or better, and it 
is poor if only very few people do it as badly or worse. Similarly, the 
rational definition of the difficulty of a task is the fewness of people 


who can accomplish this task. 
*Fewness of people, in statistical terminology, connotes either 


percentile level or, preferably, Я value. А mark is not necessarily а 


good one because it happens to be 80% or 9/10 and the like. But it is 
good relative to the marks of some particular group of persons if it 
falls at the 95th percentile, or is 2c above the mean mark obtained 
by these persons, or if it stands high ona standard scale whose mean 
and S.D. are known. Such designations of goodness or badness are 
entirely unequivocal. 

The Ease or Difficulty of Questions.—By approaching marking 
from this standpoint we obtain a rational scale, not only of the good- 
ness or badness of pupils’ work, but also of the ease or difficulty of 
questions or test items. If, for example, five items are answered 
correctly by 16%, 224%, 31%, 40 %, and 50% of a large group of 
persons, then we can state that (for these persons) the items are all 


à x 
equally spaced in difficulty. For these proportions correspond to z 


values of + 1:0, + 0:75, + 0:50, + 0-25, and 0-0 respectively. 
Similarly some questions which are below average difficulty will be 
equally spaced if answered by, say, 91%, 94%, and 96% of pupils, 
for here the proportions of incorrect answers, namely 9%, 6%, and 


4%, correspond to Х values of — 1-35, — 1-55, and — 1-75 respec- 
т 


tively. 
Justification for Equalizing the Dispersion in all sets of Marks.— 
shments vary more widely 


The claim that mathematical accompli 
than do accomplishments in other subjects requires some further 
consideration. It is true that it is usually easier to discriminate 
between different levels of mathematical than of, say, English com- 
position, ability, because tests in the former can generally be marked 


66 THE MEASUREMENT OF ABILITIES 


more objectively and accurately. Thus in the work of an ordinary 
school class it may be impossible to distinguish more than about ten 
different levels of essay writing, whereas perhaps a hundred levels of 
mathematical skill may be distinguished. Although there is some 
justification for attaching greater weight to marks obtained by the 
more efficient measuring instrument, yet this greater efficiency is no 
criterion of the range of ability, as the following analogy shows. In 
а school quarter-mile race, one referee may time the boys with а 
seconds’ hand watch, and find that some take 50, others 51, 52, . .. 
60 seconds. Another referee may possess a tenth-of-a-second stop- 
watch and find that some boys take 50-0, others 50-1, 50:2... 59:9, 
60-0 seconds. But the range of ability is the same in both cases. 

It is only fair to mention that some authorities such as G. H. 
Thomson (1939) have suggested that greater variability would be 
found in mathematical than in other abilities, if it were possible to 
measure variability absolutely. But there seems to be no satisfactory 
or convincing way of achieving such absolute measurement in 
psychology. Perhaps it is wisest, therefore, to follow the convention 
that mental testers have adopted of assuming the same dispersion 
for all abilities in representative groups of all ages. 

The Marking of Small Groups.—Peculiar difficulties still remain in 
applying the principles and techniques of this chapter to the marking 
of small groups. The mean and the S.D. of marks among, say, half 
a dozen Honours students may vary far more from year to year than 
they do among 40 or 100 students. If the average is taken as 60% 
this year, the odds are even that it will lie somewhere between 57 and 
63 next year, but it may drop as low as about 5097, or rise as high as 
70%. And if there is only one candidate who is awarded 609/ this 
year, a single candidate next year may deserve anywhere from 30% 
to 90%. Until such time as objective tests are available, standardized 
on Honours students from many universities, or over several years, 


the examiner can only employ subjective marking scales, as indicated 
in Chap. II (p. 35). 


TEST NORMS 
Test norms are tables of the scores of large groups of children or 
adults by reference to which it is possible to interpret the significance 
of the score obtained by any person, or group of persons, to whom 
the test is applied. The mental tester fully admits that a test perfor- 
mance, as such, tells us nothing as to its goodness or badness, and 
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that it must be compared with the performances of other similar 
persons tested under the same conditions. Thus some tests are stan- 
dardized, and norms are obtained, by being applied to large groups 
of typical 11-year junior school children ; others are given to school- 
leavers aged 14 to 15, or to grammar school pupils; others to adult 
Army recruits, or to university students, and so forth. American 
testers often publish grade norms, i.e. the standards obtained by 
successive school classes in their elementary and high schools. Here 
too a number of schools representative of all levels must be chosen, 
on account of the considerable variations in standards between 
different schools. 

А mere statement of the average or median scores of such groups 
is, of course, insufficient. Often the deciles are quoted, so that it is 
possible to state that a person's performance is, say, as good as that 
of the best 20% of the group upon which the test was standardized. 


x A А 
А system based on — values, or standard scores, is preferable since, 
с 


as shown above, percentile units vary widely in magnitude at differ- 
ent parts of the scale. A system which is becoming widely adopted is 
first to determine the percentiles and then to convert these into 
standard scores with an arbitrary S.D. (usually 10, 15 or 20) and an 
arbitrary mean (usually 50 or 100). By this device the converted 
Scores may be regarded as an equal-unit scale, even though the 
original distribution of raw scores was markedly non-normal. Thus 
with a S.D. of 15 and a mean of 100, whatever raw score falls at the 
84th percentile is converted to 115. The range of standard scores on 
this scale would be from about 145 to 55. 

It should by now be obvious that the usefulness of such norms is 
strictly limited by the nature of the standardization group. Norms 
are purely and simply a summary of the test performances of a 
particular group of testees, and must be interpreted with reference to 
that group. For example, if an American educational test is given 
to а British child and his score falls at the 70th percentile according 
to the norms published with the test, this means that he did better 
than 70% of the American children who took the test, but does not 
necessarily show that he is above the British median or average. In 
point of fact American educational tests are almost useless in this 
country because their standards differ so greatly from ours (cf. 
McGregor, 1934), and, of course, because their content and wording 
are often unsuitable. Probably the same is true of group verbal 


68 THE MEASUREMENT OF ABILITIES 


intelligence tests, though non-verbal tests and individual scales such 
as Terman-Merrill and Wechsler-Bellevue (with minor adaptations) 
“seem to give comparable results (cf. Chap. IX). Even between Eng- 
land and Scotland the differences are so big that few tests which 
have been standardized in one country can be used in the other, 
unless fresh norms have been collected (cf. Vernon, O'Gorman and 
McLellan, 1955). Theappalling efficiency of Scottish teaching methods 
raises the performance of Scottish elementary school children on tests 
of the 3 R's by anything from 10% to 50 % over the English level. Such 
evidence as we possess, however, points to no appreciable differences 
on tests of general intelligence. 

Many tests have been adequately standardized in London or 
Glasgow, but are not appropriate for use in the provinces or in rural 
areas where the educational level, and possibly also the intelligence 
level, may be different. Probably it would be better to provide more 
than one set of norms for such different types of school, so that a 
child’s test performance could be referred to the most suitable set. 
Apparently even the size of school has some effect upon educational 
level (cf. Mowat, 1938). Still more potent is socio-economic status, 
since this is known to have a marked effect upon educational and 
intellectual levels.t Thus if a test has been standardized by being 
applied only to children in a rather poor area, its norms will not be 
appropriate for schools in a relatively superior area. Adequate 
norms for, say, children aged 11 + would be obtained only if all 
types of schools containing pupils of this age were adequately 
represented in the standardization group. Norms derived only from 
one type of school may still be of some use, but the tester must then 
remember to interpret his testees’ results with reference to this type. 

The origin and the adequacy of many of the published test norms 
are so dubious that we would advise testers in schools to treat them 
all with caution, and when possible to do without them. Once a whole 
school class, or all the children of a certain age in the school, have 
been tested, each child’s standing relative to the rest of this group 
will be known, and will provide most of the information needed for 
educational diagnosis and prognosis, without reference to any outside 
standards, And if the same tests can be applied in a school for 


1 Cf. Burt (1937). The reasons for this effect are probably partly environmental 
and partly hereditary. Poor surroundings and upbringing undoubtedly have 
many adverse influences upon education. But in addition there is indisputable 
evidence of a connection, albeit a small one, between intelligence and social class, 
quite apart from environmental influences. 
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several years, an extremely useful set of "local norms could quite 
easily be collected, either in terms of percentiles or of c scores. 


AGE NORMS 

Age norms may be defined as follows. A child who obtains a 
Mental Age or M.A. (on an intelligence test), or an Educational Age 
or Е.А. (on an educational test) of x years is one whose performance 
is as good as that of the average child of Chronological Age or C.A. 
of x years. For example, a child aged 10-0 may do as well in an 
intelligence test as an average 11-0-year-old, but only score up to the 
level of a 9-0-year-old on tests of reading and arithmetic. He is then 
said to possess С.А. 10-0, М.А. 11-0, Reading and Arithmetic Ages 
of 9-0. At first sight this system of expressing test results is far simpler 


than the percentile or systems. It provides a uniform scale of units 


by means of which a child's relative performances on any number of 
tests can be compared. His strong or weak points in different school 
subjects, or even in particular aspects of school subjects (e.g. the 
main arithmetical skills), and his superiority or inferiority on verbal 
and non-verbal intelligence tests, etc., can be seen at a glance from 
a table or graph of his various E.A.s and M.A.s. Nevertheless, there 
are many serious weaknesses in this type of norm, which we shall 
have to describe in detail. 

Difficulties in Establishing Age Norms.—First there are almost 
insuperable difficulties in collecting standardization groups which 
will be truly representative of all the children of a certain age. Be- 
tween 6 and 11 years the primary schools will provide groups whose 
average scores are probably close to the averages in the total popula- 
tion. Any one age group will, of course, be scattered through a large 
number of different school classes. But the range of ability in primary 
5 restricted, since some of the brightest and dullest sections 
d private and special schools respectively. 
he range or dispersion of a distribution is 
verage. Beyond 11 to 12 years, when 
d to various types of schools, it becomes 
samples whose averages, let alone 
In a very few instances results have 
a given district, 


schools i 
of the population atten 
And, as we have seen, t 
quite as important as its а 
primary school pupils ргосес: 
increasingly difficult to secure 
ranges, are likely to be typical. 
been obtained from practically all the children in 


1 Cf. Fraser Roberts et al. (1935). 
M.A.—6 
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or from all the children born on certain dates. But often the best 
that can be done is to control the socio-economic factor. Suppose 
that a district contains a hundred schools, and that a tester wishes 
to standardize a test on pupils in ten of them, he should sec that the 
schools he selects cover the whole range of socio-economic level, in 
due proportion. An extension of this method was used by Cattell 
(1934) in establishing adult norms for an intelligence test. His testees 
were classified according to their occupations, and the numbers at 
each main occupational level were compared with the corresponding 
numbers in the total population. Since he had an unduly high pro- 
portion of professionals, and too low a proportion of labourers, he 
reduced the weight of the former and increased that of the latter 
group, before calculating the mean and the S.D. of test scores for his 
combined groups. 

The same difficulty in obtaining representative samples hinders us 
from tracing accurately the course of intellectual and educational 
growth from childhood, through adolescence, to adulthood. The 
available evidence, however, shows fairly definitely that this growth 
is not entirely regular, which means that M.A. and E.A. units are not 
equivalent throughout. The year by year increase of intelligence in 
an average person seems to be reasonably constant from about 3 to 
10 years, after which the rate of increase diminishes, and the M.A. 
units become progressively smaller, until a constant level is reached 
somewhere in the neighbourhood of 15 years. Probably this upper 
limit is reached at different ages on different tests, and by different 
individuals. When the growth curves of particular children have been 
followed through, they have shown considerable irregularities (Dear- 
born and Rothney, 1941). Average adolescents may show no further 
improvement after about 12 on a simple performance test but, pro- 
vided their education continues, they may go on advancing on 
complex verbal tests even up to 18 or 20. The upper limit for a very 
dull child is, of course, far lower; he may never surpass the per- 
formance of an average 10-year-old on any test. It is not yet certain 
whether he reaches this limit at the same C.A. as an average child 
reaches his, or somewhat earlier. A very bright child, on the other 
hand, naturally improves well above the normal 15-year level, and 
SO we express his intelligence as M.A. 16, 17, . . . up to 22 or even 
higher. But these units are entirely artificial. A 20-year M.A. does 
not mean that a testee's performance is equal to that of an average 

* Cf. Scottish Council for Research in Education (1933). 
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person aged 20, but only that his performance is a good deal better 
than the performance of average persons aged anything from 15 
years upwards. We must conclude, then, that M.A. units are so com- 
plex, and likely to be so uneven above about 12 years, that it would 
be far better to discard them altogether at such levels. 

Lack of Equivalence of М.А. and E.A. Units.—Still less is known 
about E.A. units. Possibly the increase is fairly steady from about 
6 to 10 or 11 years, but then there is a strong tendency in most 
schools to force children unduly, in order that they may pass the 
secondary school selection examinations. Thereafter some relaxa- 
tion occurs, at least on the formal subjects. Hence the 10- to 11-year 
unit is likely to be exceptionally large and the 12- to 13-year unit 
exceptionally small. The units may then tail off among those in 
modern schools, but among grammar school pupils and university 
students there is likely to be progress well into adulthood. Clearly 
then E.A. and M.A. units do not provide comparable scales much 
above 10 years. A child of 13 with an M.A. and an E.A. of 14 is not 
equally superior in intelligence and in education, since 14 minus 13° 
is a bigger quantity in the latter than in the former. Nor, of course, is 
this superiority of 1 year equivalent to the superiority of a child of 7 
whose M.A. and E.A. are 8 years. 

Not long after Binet put forward the M.A. system of scaling in- 
telligence, the discovery was made that a child's relative advance- 
ment or retardation do not remain constant, but increase as his age 
increases. Burt (1917) proved that the same was true of E.A. A child 
of 5 with an М.А. or E.A. of 6 will usually show an М.А. or E.A. of 
12, not of 11, when he reaches C.A. 10. In other words, a superiority 
of a year in a young child is much greater than the same superiority 
in an older child. But the ratio of M.A. to C.A., or E.A. to C.A., 
does seem to remain approximately constant, hence the 1.0. and 
E.Q. have been very widely adopted as measures of intellectual and 


educational advancement and retardation. 
Ambiguity in І.О. Units.—There is yet another flaw both in M.A. 


and 1.0. and in the corresponding educational units, namely that the 
range and S.D. of scores expressed in these units are wider for some 
tests than for others. For example, the application of educational 
tests to a class of 9-year-olds might show that their average E.A. was 
9, and that 20°% of them had E.A.s of 10 or more. But the application 
of a group intelligence test to the same pupils might show an average 
М.А. of 9 and as many as 30% with M.A.s of 10 or more. If this was 
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50, then clearly an advancement of 1 year, or an I.Q. or E.Q. of 111 
(i.e. - X 100), does not mean the same thing on different tests. Some 
intelligence tests yield I.Q.s with a S.D. of 15 or even less, others 
yield I.Q.s with a S.D. as high as 25 or more. Probably educational 
tests show similar irregularities. The S.D. of the old Stanford-Binet 
test I.Q.s was established as about 15, and many modern tests such as 
the Moray House group tests and the Wechsler Intelligence Scales 
for Children and Adults have adopted the same figure as a conve- 
nient one. Terman-Merrill I.Q.s tend to be more widely dispersed, 
their average S.D. being quoted as 163. But it varies from 12} to 20 
at different age levels. This means that I.Q.s (with the same mean of 
100) have different meanings from one test to another, or on the same 
test at different ages. An ordinary school class will commonly obtain 
a range of I.Q.s from about + 2e to — 20, i.e. from 130 to 70 on a 
test with a S.D. of 15. But the same class tested with a group test 
whose S.D. is 25 will range in I.Q. from 150 to 50. To take another 
example: a class of very bright children may have an average I.Q. of 
115 on a Moray House test. The same class tested with another group 
test may have an average I.Q. of 125. The following table illustrates 
the proportions to be expected in the total population with I.Q.s 125 
or more, or 75 or less, on tests with various S.D.s. 


% 10. > 125 
S.D. or « 75 
123 е -- 21 
15 " я " А 5 
164 " " А б 64 
20 Р " n " 11 
25 $ ; i а 16 


Unfortunately it is impossible to assert that any particular value 
of the S.D. of 1.Q.s is the ‘right’ one, and that the other values are 
too large or too small. The value depends on certain features of the 
test which are still somewhat obscure, and are too complex for dis- 
cussion here.' The figure 15 is implied when we talk of 70 as the 
border-line for mental deficiency, since most of the testing of border- 


2 The main factor seems to be the degree of heterogeneity among the items or 
subtests. The more homogeneous the test material, the higher will be the S.D. 
of I.Q.s. Hence tests composed mainly of verbal reasoning items like Cattell's 
Scales II and III tend to have higher S.D.s than more miscellaneous tests like 
Stanford-Binet. But in addition certain types of items seem to be intrinsically 
more 'spreading' than others. ? 
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line cases has been carried out with the Stanford-Binet test whose 
S.D. kept fairly close to 15. The only satisfactory solution to this 
problem is to give up the Mental and Educational Age system of 
calculating I.Q.s and E.Q.s, and to establish separate percentile or 
standard-score norms at each age level for which the test is suitable. 
The conventional form of I.Q. has been, and is still being, used far 
too loosely, as though it represented a standard scale of measure- 
ment, independent of the particular test employed. Actually, like 
any other test score, its interpretation is possible only if its S.D. (as 
well as the mean of 100) is taken into account. The derivation of 
standard scores may be somewhat more difficult for untrained 
testers to grasp. But such scores do have a uniform significance with 
different tests at different ages, and—when the S.D. of 15 is adopted 
— they do provide a scale which is very closely comparable to the 
older, generally accepted, scale of I.Q.s. 


СНАРТЕК У 
RELIABILITY AND THE THEORY OF SAMPLING 


Untrustworthinessof Resultsobtained from Small Numbersof Persons.— 
Thischapteris likely, we must admit, to cause considerable difficulty to 
readersunfamiliar with statistical concepts. Itintroducesa point of view 
with regard to educational measurements and experiments which may 
at first seem very complex and involved, yet which is absolutely essen- 
tial to a proper understanding of scientific education and psychology. * 
The scientific investigator naturally wishes to arrive at results 
which are true for people in general. He wants to prove that a certain 
method of teaching gives superior results with a// children, that the 
correlation? between intelligence tests or school examinations and 
subsequent scholastic achievement always reaches а certain figure, 
or he wants to know the average test score of all persons brought up 
in a particular environment, e.g. of rural as contrasted with urban 
children, or Scottish as contrasted with English, and so on. But he 
can seldom, if ever, measure all the persons to whom his conclusions 
are supposed to apply. He is forced to work with samples of the 
population, and he is therefore entitled to regard his result—whether 
it bean average, or a difference between two averages, ora correlation 
coefficient, etc.—only as a sample result, which might alter to some 
extent if he repeated the work on other samples of the population. 
He must, of course, take precautions in the first place to see that 
his sample is not what we call a biased one. Supposing his problem 


is the securing of norms for a group intelligence test, and he wishes : 


to find the average and the dispersion of scores among children aged 
11 +, he must, as we saw in the previous chapter, avoid choosing 
only schools in poor areas or only private schools. But even when 
such bias in selection is eliminated, he would still have to content 
himself with a limited sample of the children aged 11 + in the 
country as a whole, and the average of this sample might differ to 
some extent from the true average. Similarly if his problem was the 
1 A particularly clear explanation has been given by Thouless (19395). 


2 The reader who is unfamiliar with the conception of correlation may be 


advised to study pp. 92-5 of the next chapter before continuing with this 
chapter. = 
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comparison of the intelligence of children in suburban and industrial 
areas, he could not measure all the schools of these types. Thus the 
difference which he obtained would only be an approximation to the 
difference that might result from much more complete investigations. 


Here are two concrete instances. 

Ex. 27.—Twelve of the writer's training college students were 
given an American objective test of reading ability, namely the 
Nelson-Denny test, which measures knowledge of word meanings 
and comprehension of paragraphs. Their scores, arranged in order, 
were: 140, 136, 135, 133, 126, 120, 109, 108, 98, 89, 87, 82. 

Now, most of these scores are well above the average for American 
college students, namely 96. Their average, 113-6, falls at the 78th 
American percentile. But obviously the scores are very variable. 
The question is, then, are we or are we not justified in deducing from 
them that Scottish training college students possess higher reading 
ability than American college students? Obviously no such deduc- 
tion could be made from the score of only a single student, but does 
the average of a sample of 12 provide a much surer basis? The 
12 were picked at random from among the writer's 264 students. 
Thus he was careful to avoid selecting only those, or a prepon- 
derance of those, who had an Honours degree in English. And men 
and women were equally represented, since it was conceivable that a 
sample composed only of women or only of men would be biased. 
Yet it is quite possible that he unintentionally picked an undue 
number of clever students, and that if he picked another dozen, 
their scores might be mostly below 100, and their average no higher, 
or even lower, than the American figure. How liable then is such 
an average of twelve measures to vary? | 

The question сап be answered in this instance, since the same test 
was given to 21 other sets of 12 students, also chosen at, random. 
The 22 averages range from 103-6 to 126-8, their distribution being 
shown in the following table. As we suspected, the averages are 
rather variable, though less so than the separate scores, which range 
all the way from 62 to 160. And though our first average of 113-6 
happens to be fairly close to the grand average of 1 14-9, yet we might, 
in the luck of the draw, have obtained an average as much as 10 


points above or below this figure. 


Average Mark Frequency 

124+ . : 1L 
121 + 3 
118 + 2 
15 + 3 
12 + 8 
109 + 2 
106 + 2 
103 + 1 

22 


76 THE MEASUREMENT OF ABILITIES 


Ex. 28.—The correlation coefficient was calculated between the 
marks of 25 students on an examination given near the beginning of 
the year, and their marks on another paper at the end of the year. 
The coefficient was + 0-53. But 25 students is only a very small 
sample, and similar results calculated for 13 other similar samples 
varied widely, as shown in the following table. 


Correlation 
Coefficients 


m 
o 
4- 
© 
шә 
© 
| кюч tom 


14 

Thus the original figure is far from trustworthy, and even the 
average coefficient of + 0-315, for 350 students, though probably 
much more reliable, might also alter to some extent if the same 
examinations were compared among other similar large groups of 
students. 

The Scientific Standpoint.—The scientific investigator, therefore, 
always looks on any numerical result as a sample one which may 
deviate more or less from the truth. Even when he has excluded errors 
of selection such as would make the persons he is measuring in any 
way unrepresentative, he has to consider the probable extent of these 
so-called chance errors of sampling. He therefore goes on to assume 
that it is theoretically, though perhaps not practically, possible to 
obtain a very large number of other such samples. The two tables 
given above are instances on a small scale. Тћеу both bring out a 
vital step in our argument, namely that such additional samples tend 
to conform to normal distributions. The numbers involved are small, 
hence these distributions are irregular, but there exists ample evi- 
dence to show that larger numbers would eventually yield smooth 
normal curves. 

Let us take a fresh instance, where only a single sample result is 
available, in order to illustrate the further stages of the argument. 

Ex. 29.—The average scores for 136 men and 114 women students 
on the Cattell Intelligence Test III (cf. Ex. 3, p. 10) were 101-55 and 
99-65 respectively. Thus the men were 1-90 points superior. 


Can we deduce that men students, of the type tested, are in general 
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better than women, or would we, if we applied the same tests to 
many other similar groups, find the difference to be larger in some 
cases, negligibly small in other cases, or even reversed in others? 
This difference, which we shall call d, is only a sample one, which 
should be regarded as falling at some point on the distribution of a 
large number of possible alternative d's. Clearly the best possible 
estimate of the true difference between men and women would be a 
grand average of all these hypothetical d’s. This quantity—the mean 
of the distribution of d's—we shall call D. The particular d which we 
know is a more or less accurate approximation to D. 

A formula has been worked out by means of which it is possible 
to predict, from the actual data available, what would be o,, that is 
the S.D. of this distribution of d's. The formula will be given later. 
In the present instance it yields а, = 1-41. As the distribution is a 
normal one, this figure will enable us to define its limits. We can state 
that about one-third of the d’s must fall between D and D + 1:41, 
another third between D and D — 1:41, and that about one-third of 
them must deviate more widely from D. Scarcely any of them, or 
only 0-135 %, will be as large as D + 4-23 (i.e. Зо), and only 0-135% 
will be as small as D — 4-23. Hence 99-73% (i.e. 100 — 2 x 0-135) 
of the d's lie within the limits D + 4-23, and the odds against any 
d lying outside these limits are 99-73 to 0-27, or 370 to 1. As a general 
rule, the probability is 370 to 1 that any d which deviates from D 
purely on account of chance errors of sampling will not do so by 
more than 3. 

Certain other probabilities have been deduced from the probability 


integral graph (Fig. 18), and are listed in the following table: 


Limits Frequencies of d's Odds against 4% Р 
falling outside value 


limits у 
+ 0-6745 о, 2 x 25.0% 1101 -50 
E 1:00 o, 2x 138% Nearly 2 to 1 30 
+ 1-645 0, 2x 50% 6 . 
+ 1:96 c, 2x 25% 20 to 1 05 
+ 2:58 o, 2x 005% 100 to 1 01 
+ 3:29 c, 2x 0:005% 1,000 to 1 -001 


Now these predictions must apply to the particular d which we | 
happen to know, whose value is 1:90. The odds are very much against 
its deviating as much as 3:29 о, from D, but it might quite possibly 
deviate as much as, say, 1-96 o,, since the odds against this are much 
lower. We can therefore, as it were, reverse the argument, and state 
with practical certainty that D must lie between 1-90 + 3-29 o,, i.e. 
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between 4- 6-54 and —2-74. We do not know, of course, what is the 
value of D, but if it fell outside these limits, our 4 would be more 
than 3-29 c, removed from it, and the odds against this are 1,000 to 1. 
Similarly the odds are 100 to 1 that it lies between 1-90 + 2:580, 
= + 5:54 and — 1.74; and 20 to 1 that it lies between 1-90 + 1-960, 
= + 4-66 and — 0-86. These latter limits are commonly referred to 
as the -01 and -05 confidence limits, respectively, of our obtained d. 

The Standard Error as an Index of the Trustworthiness of a Result.— 
It may be seen then that c, provides us with an index of the margin 
of uncertainty in our d. It tells us that the difference of 1-90 in favour 
of men students is by no means trustworthy, since the true value of 
D may be considerably larger or smaller. 

This is the manner in which the scientific investigator always thinks 
of the results of psychological or educational experiments. His ob- 
tained result, being only a sample based on a limited number of cases, 
is likely to deviate to a certain extent from the true result which he 
would obtain if he could experiment on a very much larger set of 
people. He can, however, work out the c,, often referred to as the 
Standard Error or S.E. of his result, which tells him, not precisely 
how much his result is in error, but what is the probability that his 
result is in error by various amounts. There is a 2 to 1 chance that the 
error may be no bigger than the S.E., and the odds against the occur- 
rence of a larger error rise very rapidly, so that itis extremely unlikely 
to be greater than three times the S.E. 

The Probable Error.—The figures at the head of the table, above, 
show that there is an even chance of D lying within, or lying outside, 
the limits + 0-6745c,. This particular fraction of the S.E. is known 
as the Probable Error or P.E. The name is not a very good one, but 
in a sense it is the most probable error, since the odds against either 
larger or smaller errors than this are higher than 1:1. In our example, 
the P.E. = 0-6745 x 1:41 = 0:95. This means that D is just as likely 
to lie between 1-90 + 0-95 and 1-90 — 0-95 asit is to lie outside these 
limits. 

The P.E. is frequently used as an index of the unreliability or 
margin of uncertainty, in preference to the S.E. Instead of saying 
that the odds against D deviating from d by 3-29 x S.E. are 1,000 
to 1, we can say that a deviation of 5 x P.E. possesses this degree 
of unlikelihood, since 5 x 0-6745 is approximately equal to 3-29. 
Or if + 1-96 x S.E. are regarded as the limits within which D may 
reasonably be expected to lie, we can alternatively denote these limits 
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as + 3 x P.E. Often a numerical result such as our d of 1:90 is 
stated in the form: 1:90 + 0:95, where the latter figure is the P.E. of 
the former. While both the S.E. and the P.E. indicate the likely 
amount of error in some numerical result, the S.E. is, as we have 
seen, primarily a measure of the dispersion of all the possible sample 
results. The same is true of the P.E. In the hypothetical distribution 
of sample d's, the limits D + P.E. to D — P.E. enclose 50% of these 
d's, and 25% of them are greater than D + P.E., 25% less than D — 
P.E. Note therefore that the P.E. is identical with Q, the semi-inter- 
quartile range, of this distribution. For the limits О, to Оз also mark 
off the middle 50% of measures in a normal distribution. 

The Statistical Significance of a Difference. —Many experiments 
have shown that the scores of comparable groups of men and women 
on intelligence tests are practically identical, or that any differences 
found are not reliable. Since, in our example, the odds are only 21 
to 1 against D being greater than 4-72 or less than — 0-92, it seems 
quite possible that here too the d is not reliable, and that the true 
value of D is zero. If this be so, then it is easy to determine from the 
probability integral graph the chances that a sample might show a d 
of + 1:90, merely through errors of sampling. 1-90 is 1-35 x 1-41, 
ог 1-35e,. Consulting the graph (Fig. 18), we see that 9% of measures 
in a normal distribution lie at or beyond 1:35о from the mean. Hence 
when D is zero, 9% of the sample d’s amount to + 1-90 or more, 
and 41% lie between 0 and + 1-90. Thus the odds against a d of 
1-90 being due to errors of sampling are only 41 to 9, or about 43 to 
1.! In other words, it might occur as a matter of chance once in four 
or five testings of similar samples of students. Such odds are far too 
low for us to put any trust in the conclusion that men students are 
better than women at this test. Had d been 3-63 (i.e. 2:580,) we might 

1 The acute reader may well ask here: why not contrast the 9% of sample d's 
of + 1-90 and over with the 91% of smaller and of negative d's, thus arriving 
at odds of about 10 to 1? This procedure is denoted as the one-tailed test, since 
it is based on the top end or tail of the normal distribution of d's; whereas the 
test described in the text is the /vo-tailed. Many statisticians hold that the one- 
tailed test should be used when it is desired to show that a difference is as large 
or larger than a certain amount, whereas the two-tailed test is appropriate when 
attempting to prove that the difference lies outside certain limits in either 
direction—positive or negative. Others, however, prefer to use the two-tailed 
test under almost all circumstances, since it provides a more conservative 
estimate of the trustworthiness of their results. By following this practice we are, 


in effect, implying that the true sex difference might be in favour of men or 
women, rather than that the true difference might be in favour of men or else 


be zero. 
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have felt reasonably certain, since this difference could arise by 
chance only once in a hundred times. It would then be referred to as 
statistically significant at the -01 level. Note that ‘significant’ does 
not necessarily imply ‘meaningful.’ We are not concerned here with 
the psychological interpretation of any difference. Statistically sig- 
nificant refers purely to the reliability or trustworthiness of the figures. 
A difference between the results of two groups, divided by its S.E., 
is often called the Critical Ratio or C.R. In this example the C.R. is 
only 1:35. It is customary to place little reliance in a difference whose 
C.R. is less than 2 or preferably 3. Alternatively a d is regarded as 
significantly greater than zero if it exceeds 3, or preferably 5, times 
its P.E. But the border-lines are quite arbitrary matters of conven- 
tion; we are certainly not justified in rejecting all smaller differences 
because they are conventionally non-significant. Suppose the C.R. 
was 1-645 (d = 1-19), with a probability of 1 in 10. Though this 
might easily arise by chance, it does not prove that the true value of 
D is zero. Indeed, according to the ·01 confidence limits, the true 
value might lie anywhere between + 4-82 and — 2-44. Such a critical 
ratio is said to be of borderline significance, since it suggests that a 
true positive difference might be slightly more likely to occur than a 
Zero or a negative one if we were able to investigate further samples 
Another note of caution is necessary. Even if d had bcen large 
enough in the above example (relative to its S.E. or P.E.) to be 
accepted as significant, the conclusion that men are somewhat 
superior at the test would have applied only to the type of students 
tested, namely Scottish university graduates who had chosen the 
teaching career. It could not be extended to English training-college 
students, nor to other university students, far less to men and women 
in general, since the persons actually tested could not be regarded as 
typical or representative of these wider groups. Again it would be 
safer not to claim that such a difference was a difference in intelli- 
gence, but only in the ability which is measured by this particular 
group intelligence test. The statistical treatment which has been 
applied to the distributions of scores only informs us how sound and 
reliable аге the results in respect of other similar testes, і.е. it takes 
care of sampling errors, but it does not eliminate errors or bias in 
selection. As mentioned in Chap. I, statistical methods cannot im- 
prove data which are in any way inaccurate or defective ; they merely 


bring out the significance (or lack of significance) of the data actually 
obtained. 
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FORMUL/E FOR THE STANDARD AND PROBABLE ERRORS OF 
NUMERICAL RESULTS 

Common sense tells us that a numerical result such as an average, 
or a correlation, or a difference between two averages, will always be 
more reliable the greater the number of cases on which it is based. 
In Ex. 27 (p. 75) it was shown that the averages obtained by sets of 
twelve students on a test of reading ability were exceedingly variable. 
Obviously twelve is far too small a sample upon which to base con- 
clusions. In order to make a trustworthy comparison between the 
performance of Scottish and American students at this test, we 
should need much larger numbers. Most of the formule for relia- 


1 
bility involve the quantity VR which means that by quadrupling 


the size of a sample, we can halve the variability of any result com- 
puted from this sample, or double its reliability. Another important 
factor which governs the reliability of a result is the variability among 
he original measures. Suppose that, in Ex. 27, all the men students 
had scored 101 or 102 (average 101-55), and all the women had 
scored 99 or 100 (average 99-65), we should certainly have bcen 
entitled to claim that men students do better at the test than women. 
But it is because the men's scores range all the way from 73 to 134, 
and the women from 78 to 129, that we felt suspicious of the difference 
in averages, and applied statistical treatment to it. There was so much 
overlapping that about 4275 of the women obtained higher scores 
than the average man, and about 42% of the men scored lower than 
the average woman. Hence the reliability formule also involve some 
measure of the dispersion or extent of variation of the original 


measures. 
The S.E. of a Difference.—This S.E. is given by the formula: 


с = 


and o; are the S.D.s of the scores in the two groups, N, and 


Here о, 
Ў f a difference is 0:6745 times this 


М, are their numbers. The P.E. o 
quantity. 


Ex. 30.—As a fresh illustration, let us compare the scores on the 
Nelson-Denny reading test of students having Honours degrees in 
Arts with the scores of students having Honours in Science or 
ordinary degrees. The essential data are as follows: 


82 THE MEASUREMENT OF ABILITIES 


N M с 
Honours Arts students . 66 129-47 16-42 
Other students à . 198 109-97 18-90 


The difference between the two averages is 19-50. 


EE uw ааа 5 
оь , 66 | 198 4/4-088 + 1-804 = 2:427 


2-427 
could not possibly occur through chance errors of sampling. Hence 


we may conclude that Honours Arts students, of the type tested, are 
definitely superior to other graduate students at this test. 


Here the critical ratio is or approximately 8. Such a difference 


S.E. of a Difference between Averages of Scores which are Inter- 
correlated.—When one group of persons takes two tests, or if they 
repeat a test, and we wish to examine the statistical significance of 
the difference between their two average scores, we must use a 
modified formula. This applies whenever there exists a correlation, 
r, between the two sets of scores. 


Т а Я 
с, = Jae + а," — 2ro,05) 


Ex. 31.—Marks in English and arithmetic were awarded to a class 
of 57 children. These are tabulated in Ex. 23, p. 57. Apparently the 
arithmetic marks run higher than the English ones. Is the difference 
significant? The figures are as follows: 


Mean Difference с N 
English . . 67615 12:94 
Arithmetic . — . 70510 7895 1464 57 


The correlation between the two sets of marks is indicated by a 
coefficient of + 0-542. 


а, = A/3;(12:94* + 14-64? — 2 x 0-542 x 12-94 x 14-64) = 1-761. 


The difference of 2-895 is only 1-64 times its S.E., hence it is not 
significant. 


S.E. of a Mean.—An average is, as we have seen in Ex. 27, quite 
as likely to vary through errors of sampling as a difference between 
averages. Hence it is very useful to know how near to the true 
average of a very large group is likely to be the average obtained 


1 An „alternative version of the formula, which obviates calculating the 
correlation coefficient, is given below in the section on small samples (p. 85). 
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from a group of limited size. The S.E. of the mean is often symbol- 


ized by о,, and is given by the formula: о, = VW Here c is, of 


course, the S.D. of the original scores, whose average has been 
computed. 


Ex. 32.— The twelve scores on the Nelson-Denny test, listed in 
Ex. 27, have a mean of 113-58 and S.D. of 20-07. 


20-07 
о =— 5:194. 
= AA 5-794. 
The Р.Е. of M is 0:6745 x S.E. = 3:908. 

Thus the true mean for students of the type tested may just as well 
lie outside the limits 113-58 + 3:91 (i.e. 117:49 to 109-67) as within 
them. It may be as much as 2-58о„ away from 113-58, i.e. as high as 
123-66 or as low as 103-50, but it is unlikely to exceed these 0-01 
confidence limits. Obviously the figure is extremely unreliable. 

By contrast the mean for 264 students was 114-9 and the S.D. 


20-15 . " 
20-15. Here o, = A = 1-24. The sample is 22 times as large, 
1 
hence o, is VA its previous value. The P.E.„ = 0-836. From these 
figures it follows that there is an even chance of the true mean lying 
between 115-7 and 114-1, and that it almost certainly does not lie 
outside the limits 118-1 and 111-7. 


S.E. of a Standard Deviation and of a Difference between S.D.s.— 
Not only the mean but also the S.D. of a set of measures may be 
affected by errors of sampling. The formula for the S.E. of a S.D. is: 


Е. 166 
са = JN 
The S.E. of a difference between two S.D.s is given by: 
d 2N; ТОМ; 
Ex. 33.— The S.D.s of the intelligence test scores of men and 
women students were 12:36 and 9-95 respectively, indicating that 


the dispersion of ability was greater among the men. The S.E. of the 
difference of 2-41 between these two figures is: 

анааан 

1236: , 9:951 .. 9.998 

2x136 2x 114 
The difference is 2-42 times its S.E., and this shows a fair degree of 
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statistical significance. The odds against such a difference occurring 
through chance errors are 49-23 to 0-77, ог 64 to 1. The result 
therefore allows a reasonably strong presumption that men students 
of the type tested are in general more spread out in their ability at 
an intelligence test than are women students. 


S.E. of a Percentage and of a Difference between Percentages.— 
Often it is not possible to measure the amounts of some quality in a 
group of persons, but only to state what percentage of the persons 
possess or do not possess it. Suppose that in several schools 60% of 
the boys and 65% of the girls regularly take milk, does this indicate a 
reliable sex difference? Again the trustworthiness of the result depends 
on the number of cases. If p is the percentage, and д = 100 — р, 


then the S.E. of p is E and the S.E. of a difference between two 


percentages 15: 


Didi , Ро» 
eT 

Ex. 34.—From how many children would it be necessary to obtain 
milk-drinking records in order to be practically certain that the 
difference of 65% — 60% is reliable? 

This example is posed in the reverse form from Exs. 30-33, but the 
same method is employed. If a difference of 5% is reliable, it must 
be about 24 times its S.E. That is, the S.E. must not be greater than 
2. Let the number of both boys and girls in the schools where records 
are obtained be N. Then: 


gu ERA ERES N= 116 
N N 


The answer is, then, that the sex difference of 5% would be signi- 
ficant if based on some 1,200-odd boys and an equal number of girls, 
but that it would not be so if the numbers were much smaller. 


One of the commonest faults in writings about educational topics 
~ is the quotation of percentage results without any indication of the 
reliability of these results, often without even any mention of the 
actual numbers of cases from which the P.E.s could be calculated. 

It should be noted that all the formule quoted above for the S.E. 
of a difference (between means, between S.D.s, and between per- 
centages) are derived from one and the same formula: 


o, = Va E в ros 
where су and о, are the S.E.s of the means, S.D.s, and percentages 
respectively (not the S.D.s of the original scores). The third term, 
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2ro,o,, applies whenever there is any correlation between the original 
measures from which the means, S.D.s, or percentages are derived. 
But when, as in most of our examples above, there is no such correla- 
tion, the term vanishes. 

Statistical Treatment of Small Samples.—It has been shown by 
Fisher (1930) and others that numerical results obtained from very 
small samples are even less reliable than the statistical treatment 
described above would lead us to expect. Under such circumstances 
it is better to calculate a S.E. by substituting N — 1 for М in the 
denominator. 

The resulting critical ratios are denoted by the symbol 1, and 
special tables are available which show the probability of various 
values of 1 for different numbers of cases.! For a difference between 
the means of two groups, the proper formula is: 

м; — М, 
f / Эк Xu) V0.1 
ERO TN *3) s. *x) 
Уху? and Jx? are the sums of squared deviations from the respec- 
tive means. If calculated from raw scores, or from arbitrary means, 
the corrections described on pp. 46-7 must be inserted. 


For a difference between two sets of scores in the same group, it 
is simplest to tabulate the actual differences, d, between each person's 


two scores (allowing for signs). Then: 


t= pS 
" (2d)? 

J» c— 
NN-1 


This automatically allows for the correlation between the two 


variables. 
Note that this modified treatment, though statistically preferable, 


makes very little difference until the N’s fall below about 15. The 
enormous majority of educational experiments are carried out on 
much larger numbers. 

Analysis of Variance. —An extension of the ¢-technique enables us 
to explore the reliability of differences between several groups simul- 
taneously, instead of dealing with only two at a time. For example, 

1 Cf. Fisher and Yates (1938). Instead of the actual numbers of cases, the 


tables refer to what are called degrees of freedom, which are normally N — 1. 
Thus for a difference between 13 boys and 10 girls, the degrees of freedom are 


124-9 — 21. 
М.А.—7 
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we can test the significance of differences in means or percentages 
between sets of pupils taught by several alternative methods, or 
classes in several schools, or children classified according to several 
grades of parental occupation. Again the groups may be classified 
in two or more directions simultaneously, say by teaching method, 
type of school, sex and intelligence level. The relative importance of 
these influences on average test scores, and their statistical signifi- 
cance, can thus be assessed. Note that this involves an important 
departure from the model of the classical scientific experiment, 
where the effects of altering one variable at a time were observed, all 
other variables being kept constant. Hence analysis of variance, 
which was first developed by Fisher for studying the effects of 
different fertilizers or other conditions in agriculture, has become 
one of the most powerful and most widely used tools of educational 
experimentation. It has also proved of particular value in analysing 
examination marks—how far these vary with the examiner, or with 
the question or topic set, or with the school from which the candi- 
dates come, and so on. 

The detailed techniques are somewhat elaborate for description 
here (cf. Lindquist, 1940), but the following simple example will 
give some indication of their nature. 


Ex. 35.—The writer's men students were divided, according to 
their various courses of teacher training, into seven sections, cach 
numbering about 22 students. The average level of ability in different 
sections seemed to differ considerably, and their average percentage 
marks on examinations in psychology are listed in the second 
column of the table below. The third column shows, however, that 
the marks in any one section were very variable: 


Section Mean Psychology Mark S.D. of Marks 
A 72-89 8.25 
B 68-69 8-43 
С 65:59 7-88 
D 64-41 7-05 
E 63-95 8-58 
E 63:52 10:71 
а 63:48 10:71 


usually ranging from 45% to 85%. Thus in order to answer the 
question whether there exists any relation between a student's ability 
and the section to which he is assigned, we must determine whether 
these differences between the means of sections are significant or 
whether, in view of the variability of the marks within each section, 
such differences might arise through chance fluctuations of sampling. 
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“Variance” is simply another term for spread, dispersion, or vari- 
ability (actually the variance of a distribution of measures is the 
square of its S.D.). Analysis of variance enables us to say how far 
these differences between sections can be ascribed to the effect of 
individual differences within sections. The result is expressed in terms 
of the F-ratio, in much the same way as the significance of differences 
between two groups is expressed as 7. (Indeed an F-ratio when only 
two groups are compared is identical with 7°.) In this instance the 
F-ratio is 2-93, and appropriate tables tell us that the probability of 
such differences between seven means arising by chance is quite 
small, approximately 0-01. We may conclude that there is a definite 
tendency for better students to be put into some sections, poorer 


students into others. 


S.E. and Р.Е. of a Correlation Coefficient..—The correlation be- 
tween two sets of measures is likely to be very unreliable if the num- 
ber of persons is small, as was shown in Ex. 28 (p. 76). The S.E. or 
P.E. of a correlation coefficient tells us how widely this coefficient 
may be expected to deviate from the true value, i.e. the value which 
would be obtained from an unlimited number of persons. The 
formule for these errors will be given in Chap. VI. 

S.E. and P.E. of an Individual Test Score or Examination Mark.— 
If some pupils or students take a test or examination two or more 
times, or answer two or more supposedly parallel tests in the same 
subject, they will not obtain precisely identical scores on each 
occasion. Hence any score or mark derived from only a single test is 
to some extent unreliable. It may deviate more or less from the score 
which would be obtained with far more thorough testing. In other 
words, the single test only gives us a limited sample of the pupils' or 
students’ capabilities. Here too, then, the conception of the S.E. has 
valuable applications. But the reliability of a score depends, not on 
the number of persons who take the test, nor on the wideness of 
dispersion of scores, but on the thoroughness of the test itself, or the 
number of items of which it is composed, and on the amount of 
variability in pupils’ answers to these items. Elementary methods of 
calculation are given in Chap. VII. A full treatment of the underlying 
theory may be found in Gulliksen’s (1950) book. 

Finally, when a test or examination is employed for predicting a 


1 t seems to be the convention in psychological literature to employ the S.E. 
in the statistical treatment of averages and the like, and the P.E. in the treatment 
of correlations and individual scores. There is no logical reason for the conven- 
tion, but we shall usually adhere to it in this book. 
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pupil's success or lack of success during his subsequent educational 
or vocational career, it is obvious that the estimate obtained from 
the test score will deviate more or less from the truth. The reliability 
of such predictions naturally depends mainly upon the adequacy or 
validity of the test. It, too, can be expressed in terms of a S.E. or P.E., 
and the appropriate methods will be given below. 


TESTING DIFFERENCES BETWEEN CATEGORIC 
VARIABLES BY CHI-SQUARED 


In educational and psychological research, there are many types 
of data which we may wish to analyse or compare, but which are not 
amenable to averaging or to the calculation of Standard Errors. For 
example, the colour of people's eyes cannot readily be measured as 
a continuous variable, yet we can classify individuals into a number 
of more or less distinct groups or categories according to their eye- 
colour. Similarly, we may wish to study the numbers of pupils at 
different ages who go to the cinema 0, 1, 2 or more times per week, 
or the numbers who answer Yes, Doubtful or No to an item in a 
questionnaire. Or we may wish to compare how many boys and 
girls are given gradings of VG, G, Fair and Poor in an examination 


Ex. 36.—A group of 80 male students stated anonymously that 
their voting in a General Election was: 23 Conservative, 13 Liberal, 
and 44 Labour. We shall take it that, in the country as a whole, the 
proportions had been 45%, 10% and 45%. Are we justified in con- 
cluding that the students were more left-wing than the average, or 
might this difference arise just through chance errors of sampling? 
Our approach is to assume that no real differences exist between our 
group and a group of 80 typical voters—this is termed the ‘null 
hypothesis', and then to analyse the degree of unlikelihood of the 
difference occurring by chance. 

In the following table, F gives the obtained frequencies of students 


F Fe вв (-m € ui 
Conservative . 23 36 — 13 169 4-694 
Liberal . = 13 8 +5 25 3:125 
Labour . . 44 36 4-8 64 1:778 

80 80 x? = 9:597 


in each group or ‘cell’. F, represents the expected frequencies in an 
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equal number of typical voters (45%, of 80 = 36, and so on). We 


ДЕ — F.)* 
“se: 


then calculate Chi-Squared (y?) = Note that the 


е 
F — Е, column must add up to zero. The Chi-Squared Tables,? 
given in most statistical textbooks, tell us that this value, 9-597, has 
a P or probability of close to 0-01, i.e. that it might arise by chance 
only about once in a hundred similar samples. Hence there is a 
fairly strong presumption that students of the kind we have studied 
are less conservative than the general body of voters. Had our sample 
been double the size, chi-squared would likewise be doubled, and 
the probability would have dropped to less than 1 in 1,000. 


The same procedure is used for comparing two or more groups, 
except that we now ask whether our samples show the same distri- 
bution, not whether they conform to some predetermined typical 


distribution. 


Ex. 37.—Groups of 75 boys and 65 girls gave the following as 
their numbers of visits to the cinema per week. Is there a statistically 


reliable sex difference? 


F Fe (Е— Fe)? 
No. of И =з 
Visits Boys Girls Total Boys Girls 
0 5 10 15 8-03 6-97 1-14 1:32 
1 33 32 65 34.82 30:18 0-10 0:11 
2 26 18 44 23-57 20-43 0-25 0-29 
8 T 4 11 
4 3 1 4 8:57 7-43 0-69 0-74 
5 1 — 1 ———— 
E =, x? = 4-64 
75 65 140 74-99 65:01 P = -20 


If the null hypothesis were true, we would expect the same pro- 
portion of boys as of girls in each line of the table; i.e. of the 15 


ram de 65 
children with 0 visits, 140 x 15 — 8-03 should be boys, and 140 x 


15 = 6:97 should be girls. Each entry in the F, columns is similarly 
calculated. Note that these 2,58 must add up to the same totals as 


1Such tables show the probability of chi-squared for different 'degrees of 
freedom’. Degrees of freedom here means the number of cells whose F's must 
be known in order to complete the distribution, in this case 2. For the total 
number of students is fixed, and we need only to know the numbers voting for 
two of the parties in order to determine the F in the third cell. 
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the F's both across and down the way. The last three rows are best 
combined together, since there is a general rule that no value of F, 
in a chi-squared analysis should fall below 5, or preferably 10. 
(F Tod Fa)? 
F, 
The obtained value of chi-squared is not significant statistically! ; 
our differences between the sexes could occur by chance 1 in 5 times 
in other similar samples. 


is calculated as before, but this time for both columns. 


This particular problem might also have been studied by compar- 
ing percentages, using the formula on p. 84. 49-37; of the boys and 
35:4% of the girls go twice or more a week. The С.К. works out at 
1-68, which has a P value of 0-10. Though this gives a slightly more 
favourable result, it too provides no acceptable evidence of a reliable 
difference; and the more detailed chi-squared analysis js to be pre- 
ferred. Another point brought out by this example is that the F'sin 
each cell should invariably refer to numbers of separate and indepen- 
dent individuals or items. А very common error among students is 
to say that 26 boys have 52 visits between them, and 18 girls 36 
visits, and so on, and then to apply chi-squared to the table of visits. 

Another useful application of chi-squared is in testing whether an 
obtained distribution conforms to the normal (or any other theo- 
retical) shape. We have seen that distributions of small numbers of 
measures may depart considerably from normality, and that by in- 
creasing N, the irregularities tend to become ironed out. Here also, 
therefore, the result obtained from a limited sample (namely an 
irregular distribution) can deviate considerably from the result which 
a much larger sample would yield (a regular distribution) purely 

' through errors of sampling: and the unlikelihood of this deviation 
can be calculated by chi-squared. 


Ex. 38.—The following figures are extracted from Ex. 20 (p. 50). 
They show in the F column the actual frequencies of scores obtained 
by students on the Cattell test, and in the F, column the frequencies 
which would be expected if the distribution of scores was a truly 
normal one. The procedure is as before, the smallest frequencies, at 


1 Here the degrees of freedom are 3. The totals across and down the way аге 
fixed, thus we need to know only 3 cell entries to complete the table. In the 
following example, there are 8 degrees. For though 11 pairs of categories are 
compared, 3 quantities are fixed—the number of cases, the mean score, and the 
standard deviation. For further explanation of this rather tricky concept of 
degrees of freedom, see Fisher (1930). 
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F— E: 
F Fe к=& Cue 
1 1 
1 i} E aoo, 0-6667 
6 4 
11 9 +2 0-4444 
22 19 45 0-4737 
26 30 24 0:5333 
35 al —6 0-8780 
47 44 +3 0-2045 
43 40 39 0-2250 
27 28 zu 0-0356 
18 18 0 0-0000 
10 9 fe] 0-1111 
3 4 
4 q e. 1:5000 
250 250 5.0723 


the head and foot of the columns, being grouped together. Chi- 
squared has a probability of 0-75. We might expect as great а 
deviation, or greater, in three instances out of four by pure chance. 
Thus the approximation to normality is quite satisfactory. 


CHAPTER VI 
MEASUREMENT OF CORRELATION 


A very wide range of problems in educational and other branches 
of psychology depends on the study of relationships between 
variables. For example, examinations are set at the end of the pri- 
mary school career, partly in order to test the work that the children 
have done, but also partly because achievement in English and arith- 
metic is supposed to be related to, or predictive of, future achieve- 
ment in more advanced work. The main object of applying intelli- 
gence, or other aptitude, tests is to show probable ability in various 
fields which are related to the test performances. In many secondary 
schools the pupils are allotted to different classes for general work, 
for mathematics, and for foreign languages, since it is believed (with 
much justification) that abilities in these three types of study are not 
very closely related, that some may be good in general work, poor at 
mathematics, medium at languages, and so on. 

Several different methods, applicable to different kinds of data, 
are available for determining the extent of relationship between two 
variables. Almost all of them yield indices ranging from 0 (no 
correlation) to + 1-0 (perfect correlation). By far the most frequently 
used are the product-moment correlation coefficient and the rank- 
order correlation coefficient. We will point out later the limitations 
of these two methods, and mention briefly some of the alternatives. 


Ex. 39.—The following marks were obtained by a class of 36 
children on examinations in history and geography. A glance will 
show that there is some correspondence; that those who obtain 
high marks in one tend to get high marks in the other, and that low 
marks in one tend to go with low marks in the other; but that there 
are many exceptions to this trend. The state of affairs may be ееп 
more clearly if the two sets of measures are plotted against one 
another. Fig. 21 is called a scatter-diagram or scattergram. The 
marks have both been grouped into classes of 2. History marks (X1) 
are shown on the vertical axis, geography marks (X;) on the hori- 
zontal axis. Each pupil's position in the two exams is shown by a 
tick. For example, Alan's 16 and 12 are represented by a tick 
opposite the 15-16 mark in history, and below the 11-12 mark in 
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X X. | 
Name History Geography (0X x X; X, 
Alanis «ccm еә 16 12 10 5 8 6 
Ше, uestem 19 15 14 18 20 12 
John 2 17 u 8 6 17 11 
дора a ree 10 17 6 + 12 8 
PR рана SESS 18 14 | 18 20 7 2 
кел eNO элеее э i 18 16 14 8 17 16 
zum S3 15 4 | 12 4 14 T 
— 17 13 | 18 14 14 9 
RP 12 12 16 10 12 8 
Wee 16 8 6 4 16 1 
kin к калка mue 8 11 10 | 18 14 20 15 
14 9 | 12 11 17 6 


geography. Four pupils obtained either 17 or 18 in history and 
either 13 or 14 in geography, hence there are four ticks in this box 


or cell. 


GEOGRAPHY (X7) А 


1-2 3-4 15-6 |7-8 |9-10 11-12 


eal наан 
mn | E 
15-14 d- ЭШ t 


1-12 | 

rim | jr 

7-8 | „4“ | 

rz и = JE 


7 
Fic. 21.—Scattergram, comparing two sets of school marks. 


HISTORY (Xj) 
A 


It will be seen that the majority of ticks tend to be grouped along 
the dotted diagonal line, although there are some rather extreme 
exceptions; for instance, the pupil with marks of 16 and 1, and the 
pupil with marks of 10 and 17. Obviously the closer the correspon- 
dence between the two sets, the more closely would the ticks fall along 
the diagonal, and the fewer would be the exceptions whose ticks are : 
far removed from it. Fig. 26 (p. 133) and Ex. 49 (p. 113) show the 
scattergrams for a low correlation of + 0-146 and а high correlation 
of + 0-799 (the numbers in each cell represent the numbers of ticks). 
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Only if each pupil got marks which exactly corresponded would the 
correlation be perfect and the coefficient be + 1-0. Note, however, that 
it is not necessary for the marks on the two variables to be identical 
for a perfect correlation to be obtained, since these marks may be 
arranged on different scales. In Fig. 21 the geography marks are on 
a different scale from the history marks; 10 on the one corresponds, 
or is relatively equivalent, to 14 on the other. Consider another in- 
stance: the weights and heights of a class of children would certainly 
yield a very high correlation, although they would be measured on 
entirely different scales—pounds and inches. It is the closeness of 
agreement of relative scores or measures which governs the size of 
the correlation. 

Very few correlations in educational or psychological investiga- 
tions approximate to + 1-0. Usually there is only a general trend for 
high scores on one variable to go with high scores on the other, a 
trend which admits of many exceptions. Two tests of the same ability, 
e.g. two similar arithmetic papers, may perhaps give a coefficient of 
-+ 0-90 or over. Tests or exams in different subjects, e.g. arithmetic 
and English, or history and geography, are more likely to give 
moderate correlation coefficients of + 0-40 to + 0-70. 

Occasionally one may find negative or inverse correspondence, 
when high scores on one variable go with low scores on another 
variable. For instance, in an ordinary school class intelligence or 
scholastic achievement and age are often negatively related, since 
the oldest pupils are rather dull, the youngest pupils very bright. 
Negative correlations are expressed, just as positive ones, by indices 
ranging from 0 to — 1-0. 

Calculation of Correlations by the Product-moment Method (some- 
times called the Pearson-Bravais Method).—The fundamental 
formula for the correlation coefficient, r, is: 


Here с, and o, are the S.D.s of the two distributions. x, represents 
a score or measure expressed as a deviation from the mean of all the 
X, measures, x, is a measure expressed as a deviation from the mean 
of all the Y, measures. хух, is the product of a person's two measures, 
and Ххх, is the sum of all these products. 

Since the means are seldom likely to be whole numbers, the com- 
putation of Zx,x, would involve very troublesome arithmetic. 
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Hence, just as in calculating S.D.s, it is usual to resort to the device 
of the arbitrary mean. Each of the quantities in the above formula 
then requires a correction. If x, is a deviation from an arbitrary 
mean, o, is no longer 


e FE 2x,\* 
N N N 
Ххх 


(cf. p. 47). A corresponding alteration is made in оз. N becomes 


Ухх, 2 E 
Lu MULT cA Hence the full formula is: 


N N 


NC N N 


ds nos Xu [Xj [2x 
N ( N ) N -( N ) 
This same formula applies if the original scores are used, i.e. if X, 
and Y, are substituted for x; and x,. The only difference is that the 
arbitrary mean is now zero, instead of being some whole number 
chosen close to the true mean. 

Statistical textbooks often provide modifications of this formula, 
or ‘patent’ correlation techniques which are claimed to be simpler. 
The present writer prefers the direct method given here, beth because 
it makes use of concepts such as the true and arbitrary mean, а, and 
the correction required when о is calculated from an arbitrary mean 
he student should already be familiar—and 
because errors in computation are fairly readily detected. Moreover, 
a coefficient accurate to three places of decimals can be obtained by 


this method if the multiplication and division, and the finding of 
squares and square roots, are done with a 10-inch slide-rule and/or 


four-figure logarithm tables. 


—concepts with which t 


Ex. 40,—The formula will be applied first to the original table of 
marks, given in Ex. 39, and later to the same marks grouped into 
classes (as in Fig. 21). This first method should generally be employed 
only when № is small, say Jess than 50, and when a high degree of 
accuracy is needed." In the following table, columns X, and X, list 
the original marks. As arbitrary means, 14 and 10 are selected. 


culating machine is available, this method is generally 


1 When an electric cal с this n 
used with original scores (not an arbitrary mean), even if N is very large. 
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Columns x, and x; list the deviations of X, and X from these A's. 
They are summed to give Zx, and Zx. 

Ao 3» Эх. —>2 

up TUS o Os 


Therefore the true means are 14-139 and 9-945. Later we shall need 
®ху\* / уз Хх, Хх, 
CR) FR) and + X^N 
These come to + 0-019, + 0-003, and — 0-008 respectively. Note 
that three places of decimals are sufficient for these corrections. 


Special attention should be paid to the sign (+ or —) of m x ES 


The two columns on p. 96, x,? and хз?, give the deviations squared. 
These are summed and averaged. 


Ax 549 г, Ex, 836 5. 
A5 3-009 ^w- --222 
Hence о; = = x (5) = \/15-250 — 0:019 = 3:903 


оз = 4/23:222 — 0:003 = 4819 
The last pair of columns gives the products of the deviations; the 
-- and — products are separated merely for convenience in summing. 
Special attention to the signs is needed. When x, and x, are both + 
or both — the product is plus, when one is + the other — the pro- 
duct is minus. Notice how the exceptional cases, such as the pupil 
with 10 and 17, make large negative contributions to 2x,x,, and so 
ultimately reduce the size of r. The pupils who get very high marks 
on both variables, or very low marks on both, such as those with 6 
and 4 or with 18 and 20, make large positive contributions to ха 


and help to raise r. 


Ххх, _ + 392 _ 10. 
Mi gr 10-889. 

" Ju су За x { кее 
The correction, ^x x др is subtracted from this, but as its sign 


was —, it is in this case added. 
Zxy Xx „ 228 — 10.889 — ( — 0-008) = 10897 

N NN ( ) 
АП the corrections indicate 
made, hence we can finally apply the fir: 


tion formula: 
пуж 1 \__ 1089 _ g. 
DN (>= MU FII A d 


d in the second formula have been 
st, ог fundamental, correla- 
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Thus the coefficient shows moderate correspondence between the 
two sets of marks, as we guessed from looking at the scattergram. 


Product-moment Correlation between Tabulated Measures.—When 
the measures are grouped into classes, the procedure is essentially 
identical, except that the computations are, of course, performed for 
all the measures in any one class at one time. We are no longer con- 
cerned with the deviations of single measures from their mean, and 
their products, but with deviations of classes from the mean class, 
and their products. At no time is it necessary to translate the results 
back into terms of the опата! measures, by multiplying by с (the 
size of the class). 


Ex. 41.—The two sets of measures are first plotted in a scatter- 
gram, which should contain, if possible, between 11 and 19 rows and 
a similar number of columns. There is no need to have identical 
numbers of rows and columns. In Fig. 21 (p. 93) the data are, as a 
matter of fact, rather too coarsely grouped, since there are only 8 


B= —4—8 —2 ~] 


041424384445 F, 
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An appropriate arbitrary mean is now chosen for the x, classes 
and for the x. classes, and the classes above and below these means 
are numbered + 1, + 2,...— 1, — 2, .. . etc. These numberings 
are written at the left-hand side of the rows, and above the tops of 
the columns. Standard deviations are now calculated in the usual 


way. 


x; and xs Fi Е. | Fim Ех» Fx Faxa? 
id 1 +5 25 
ка 2 8 32 
+3 3 4 | +9 12 27 36 
+ 10 4 20 8 | 40 16 
+1 5 7 5 7 5 7 

0 5 3 +34 +40 0 0 
—1 6 4 — 6 — 4 6 4 
—2 2 4 4 8 8 16 
—3 8 5 9 15 27 45 
—4 2 2 8 8 32 32 

36 36 —27 — 35 145 213 
ЖЕүхү EF; 
a 7 ===>5 | 
ЖЕ F7 (==) = y 
N —g = + 0-194 0:0378 
хра TS 2:21 0 
Decr S = 0-019 
N 36 + 0-1389 ( 0193 


The correction == х Tee which will be needed later in the 
numerator of the correlation formula, is: 
+ 0:1944 x + 0.1389 = + 0:027 
XFyg? 145 4. 
N 4-028 


ХЕ? 213 2, 
па 36 5917 
в = V5917 — 0019 = 2:429 


Note that these S.D.s are approximately one-half those obtained 
in the previous computation of г (namely 3-903 and 4-819), the 
reason being that we have grouped the original measures into classes 
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of 2. They are not exactly half because the grouping has produced 
some slight distortion of the distribution. 

Now to obtain the products, the frequency in each cell must be 
multiplied by its x, value and by its x, value. For instance, the entry 
2 in cell (x, = + 3) (х = + 3) contributes 2 x 3 x 3 = + 18 to 


ZFx,x, The entry 1 in cell (x, = — 2) (x, = + 4) contributes 
1х4х 22—8. 
Cells corresponding to Frequencies Total Ехуха 
ха this value of хуху in these cells F's 
Xxy—4-1 хз = 0 1 
0 0 1 
—1 0 1 
0 = | 2 a к 
3 0 +1 1 
. 0 +4 1 
A d dpud. „йш i au 
+ 2 х= +2 X= +1 2 2 +4 
+ 3 мет, Д i 3 +9 
И лог mets 5 +20 
p ate шл: 2 4 +24 
Че 9 х= +3 х= 43 2 2 + 18 
+ 10 х= +2 х= 4-5 1 1 +10 
+ 12 х= – 3 х= — 4 1 36 
a “а j } d. р 
+ 123 
ml х= 41 X% = — 1 1 
==] à 41 2 » 3 = 8 
—-3 m=4+1 x-—3 1 p ^33 
— 4 х= +2 y= — 2 1 8 
+1 4 1 } 2 = 
Б a= ж-ы 1 1 —8 
— 22 
хіх, = + 101 


It is unnecessary, however, to consi 
we take each possible value o х ү 
рапушр table. АП the се] 
bute Zero to ХЕхух,. 
cells is 7. The total 
hence + 2 is entere 


nsider each cell in turn. Instead 
ue of Хо In turn, as shown in the accom- 
Is for which either x, = 0 or Xy = 0 contri- 
It will be seen that the total frequencies in such 
frequency in the cells where Xjx, = + 1 is 2, 
Ч in the last column, 
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As soon as the student becomes familiar with the method he will 
find it unnecessary to write out the second, third, and fourth 
columns of such a table. Only the first, and the last two are needed. 
The simplest plan is to draw a cross through each of the cells for 
which хух = 0, and to count up the frequencies in these cells while 
doing so, and then enter the total in a table thus: 


XiXa F Ехухг 
7 


+1 2 +2 


Next cross out the cells where xx, = + 1, and enter in the table, 
and so on. At the end, sum the F column, in order to make sure that 
none of the entries in the cells have been omitted. The last column 
is summed to give 2Fx,X2, and the rest of the working is as before. 


=: = а = 2-806. The correction, to be subtracted, is 


+ 0027. 
2379 aw 
2806 — 0027 = 2779. r—5455 x 1997; = + 0573: 


It will be seen that in spite of the distortion due to grouping the 
scores into a rather coarse scattergram, the result is nearly identical 
with that obtained from the ungrouped measures, namely 4- 0:579. 


There are, of course, numerous possibilities of error in working out 
product-moment correlations. АП additions should be checked up 
and down; multiplying, dividing, squaring, etc., may be done with 
logarithm tables and confirmed with a slide-rule. But a little practice 
will enable the student to predict r fairly closely from consideration 
of the scattergram, and so to judge whether or not the value he 
reaches is likely to be correct. 

An Alternative Method of Calculating Correlation from the Scatter- 
gram.—We will describe one of the various alternative methods, 
which may be found simpler to use than the above, direct, method. 
It eliminates the collection of the contributions to ХЕхухз from each 
cell, and substitutes the calculation of two additional S.D.s, which 
we shall call оз and c. Actually, one of these is sufficient, but the 
other acts as a valuable check. The formula for r is: 

o — 012 — 032 oc + оз? — os? 
20,05 20,05 
and o; are the S.D.s of the F, and F, distributions as before. 


r= 


Here оу 
M.A.—8 
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The other two distributions are obtained by counting the frequencies 
along the diagonals. 


Ex. 42.—Applying this to the scattergram in Ex. 41: for а, we 
start from the top right-hand corner and proceed to the bottom 
left-hand, counting all the entries along the diagonal at right angles 
to this line as we go. On the diagonal running from (x, = + 4) 
(х» = + 3) to (х = + 2) (х = + 5) there is one entry. The next 
diagonal runs from (x, = + 3) (x, = + 3) to (ху = + 1) 
(xs = + 5); this takes in 2 entries. The next also takes in 2, the 
next 1+ 4 + 1 = 6, and so on. The complete distribution, Fy, is 
given in the following table. The corresponding distribution F;, 
along the diagonal from top left to bottom right, is also listed. 


Arbitrary means are chosen and the S.D.s worked out in the usual 
way. 


Е, Е, а | Ра Fd Fd? Fd? 
1 +7 | 7 49 
2 +6 | 12 72 
2 1 +5] 10 5 50 25 
6 2 +4 24 8 96 32 
2 0 +3 | 6 0 18 0 
2 4 +2 | 4 8 8 16 
2: S Ti 2 6 6 
5 10 +65 +27 0 0 
3 8 -1 3 8 3 8 
2 2 —2| 4 4 8 8 
L 1 —3 3 3 9 9 
3 1-44 12 4 48 16 
2 0 —5 49 0 50 0 
0 1 —6 | 0 6 0 36 
a = A 147 

| — 53 — 25 | 560 156 
ZF,d 65 — 53 ° 
N ag 5901345 ia 360 _ 15-556 


E . 95^ = 15-556 — (0.333): = 15. 
Similarly o, = 4-333 — (0:056): ш mde 
We already know оу? and оз? to be 3-990 and 5:898. Hence: 

15:445 — 3:990 — 5.898 
fu ee = P % 
2 х 1997 х 2:429 = + 573 

The version using с, 


; 0 instead of i h; 
also identical with it оз gives the same result, which is 


hat obtained on p. 99. 
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Rank-order Correlation.—A. different method for determining 
correlations is available for data arranged in order, for example the 
orders of merit of a school class on two examinations. When the 
number of cases is greater than about 30, the figures involved become 
troublesomely big, but with a small number the method is much 
easier to use than the product-moment method. Often, therefore, sets 
of scores are turned into rank orders and the correlation between 
these orders is computed. The formula is as follows: 

6Xd* 
r=l— me 
N(N? — 1) 
where d is the difference between each person's two ranks. Such a 
correlation is often indicated by the Greek letter о (rho), instead of 
by r. 

Ex. 43.—The second and third columns of the following table 

give the marks of eleven pupils on tests X, and X.. The next two 


Pupil X; х, Ranks d а? 
A. PE 65 75 6 4 2 4 
B 85 66 1 7 6 36 
С 64 58 7 9 2 4 
р 78 82 3 1 2 4 
Е 60 58 8 9 1 1 
Е 70 80 5 24 24 6-25 
б ; 83 80 2 24 i 0:25 
H . 55 71 10 b: 5 25 
L og 50 58 11 9 2 4 
Jo 72 55 4 11 7 49 
ES м 59 68 9 6 3 9 
142-5 


columns show the ranks, pupil A being 6th on X4, 4th оп X5, and 
so on. The d column gives the differences in rank, regardless of sign. 
These are squared and summed. Substituting in the formula: 


Clearly the bigger the differences in rank positions, the bigger will 
Ха? be, and the lower the value of r. If 6 d" is greater than N(N?— 1) 
the correlation will be negative. 

Now there is no difficulty in turning scores into ranks when, as in 
X, in the above example, every person gets a different score. But 
when two (or more) persons tie with the same score, they then divide 
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the two (or more) ranks between them, as in Х,. Thus pupils Е and 
G must share 2nd and 3rd place, and are both called 24, whilst the 
next highest pupil, A, is called 4. Again, С, E, and 1 share the 8th, 
9th, and 10th positions, and are all called 9, whilst J, the next highest, 
is 11. Such sharing is undesirable in that it distorts, and generally 
decreases, the size of the correlation. It follows, therefore, that the 
rank-order method should not be applied when there are a great 
many ties. 


OTHER METHODS OF MEASURING CORRELATION 

The nature and the main uses of other methods will be indicated 
briefly, but they will notbe described in detail. Full accounts of them 
may be found in more advanced textbooks. 

Spearman's Foot-rule Method.—This, like the rank method, is 
applicable to data arranged in rank order. It is perhaps even simpler 
than the ordinary rank method, but the coefficients which"it yields 
range only from 4- 1-0 to — 0:5, not from + 1-0 to — 1-0, and they 
often diverge rather widely from those obtained by the first two 
methods. A new and easy method, which possesses certain important 
statistical advantages over the rank method, has been described by 
Kendall (1948). 

Correlation Ratio.—This is denoted by the symbol » (eta), instead 
of by r. We saw in Fig. 21 (p. 93), that the ticks in a scattergram tend 
to be grouped along a straight diagonal. This is referred to as "linear 
regression.’ When the regression is non-linear, that is, when the ticks 
tend to be grouped round a curve, a product-moment correlation 
fails to show the closeness of relationship, and the correlation ratio 
18 used instead. For example, when tests of motor perseveration are 
compared with assessments of desirable character traits, it is some- 
times found that both the persons with high perseveration scores 
and those with low scores are weak in character, whereas those with 
medium Scores get the best character assessments. This type of 
relationship is illustrated in the accompanying scattergram, Fig. 22. 

Coefficient of Contingency.—Ordinary correlation methods cannot 
be applied to categoric variables such as those described on p. 88, 
fie Si Sere We can nevertheless analyse 
afilistionend scar ne say, | air and eye colour, or between religious 
Poncii conomic grade, etc., by an extension of the chi- 

chnique. The individuals are first classified into convenient 


numbers of categories for each variable, and a cross-classification 
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CHARACTER RATINGS ——> 


MOTOR PERSEVERATION SCORES —? 
Fic. 22.—Scattergram of non-linear correlation. 


table—similar to a scattergram—is drawn up. We then assume that 
there is no relationship between the variables and compute, on the 
basis of this hypothesis, how many individuals would be expected to 
fall within each of the cells. If the actual frequencies diverge from 
these expected ones to an extent which cannot possibly be ascribed 
to chance errors of sampling, this provides evidence of a relationship 


between the variables. 


Ex. 44.— Records of the times spent on television viewing over 
one week were kept by 293 grammar school pupils, who were also 
classified according to the number of years their families had had 
television sets. Does the following table indicate that pupils from 
families which have owned sets for a long time view less than 


relatively recent owners? 
Length of Ownership 


Less than 1to2 3 years 

] year years and over 
Hours of 8 + 14 31 41 86 
viewing 4—7+ 14 49 66 129 
inl мек O—3+ 5 21 52 78 


N 
мо 
~ 


33 101 159 
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Note that there is no need to have the same number of categories 
for both variables. Quite broad ones are generally adopted, so that 
no value of F, falls much below 10. 

Now if there were no relation between the two variables, we 
might expect the 86 pupils with 8 + hours viewing to be distributed, 
as regards their ownership periods, in the proportions 33 : 101 : 159. 


33 
That is, the F/'s for the first line of the table should be: 86 x 29$ 
к ioe and 86 x е = 9-7, 29-6, and 46-7. The other entries in 


the F, table are similarly calculated, and the F — Е, table does 


suggest that a certain excess of short-owners view for more hours 
than long-owners. 


97 296 46-7 86 
145 445 700 129 ed Ed 

88 269 423 | 78 | as sel =54 | 

—05 FAS —49 | 
33 101 159 | 293 | (538 —3$ 497 | 
TABLE or F?'s. TABLE OF (Е — Fe). 

ў ИЙ mem] Chi-squared = 8-55, and as there / 
Aes ay 110 | are 4 degrees of freedom (cf. footnote, 
164 129 232 р. 89), this has a probability of 0-07. 

add | Such odds are not high, but they are 
F- FR) í Suggestive of a slight negative corre- 
Tante or = lation between amount of viewing and 


length of ownership. 


1 This chi-squared can be converted into what is called the con- 
tingency coefficient, C. A further modification yields a coefficient 
which is more closely comparable to a correlation, Tis 


С = F. XT Ру == E 
N+ x * Eh 
Here, d.f. refers to degrees of freedom, an 
for small numbers of categories. 


d k provides a correction 


No. of k 
Categories 
2 0-798 
3 0-872 
4 0-923 
5 0-949 
6 0-964 
8 0-979 
10 


0-986 
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In our example, the pupils are grouped into 3 categories on each 
variable, and kı = ką = 0:872. Thus: 


"Ma :—: 
293 + 8:55 %7 03872: A 301-55 — 4 
= 0:172 = 0-163. 


Considerable caution is essential in interpreting contingency, 
since strictly it is a measure of divergence rather than of relationship. 
Suppose, for instance, that our original table of F's had been: 


а и 31 | 86 | 
| 66 14 49 | 129 
и з ире 


| 159 33 101 | 293 


C апа гу would be the same as before, although the relationship is 
quite different; for now it is the middle-ownership group which 
views most, and the short-ownership group which views least. Thus 
contingency coefficients never have a negative sign even though, as 
in our example, they happen to express a negative correlation. 


Two by Two Tables.—Ex. 45.—Taking the data from Ex. 39, 
suppose that we do not know the pupils’ actual marks but only the 
categories ‘top half’ and ‘bottom half? on the two examinations. 


The following table may be drawn up: 


GEOGRAPHY 
Top Half Bottom Half 
11 or over 10 or under 


Top Half 
15 or over 


Bottom Hal 
10 or over 


HISTORY 


There are several ways of determining a correlation for such data. 
Two that have often been applied are Yule's coefficient of association 
(Q), and his coefficient of colligation (о). 


1 There is no necessity for the proportions to be equal, as they are in the 
present instance. But coefficients are more reliable the nearer the split is to the 


median. 
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ad — bc 
Q= ad be 
where а, Б, c, and d are the frequencies in the four cells. Substituting, 
wefind Q— + 0-742. Neither of these coefficients is at all closely 
comparable to a product moment correlation, and contingency is 
therefore somewhat preferable. For the same table, 4? = TH 
hence С = 0-406, and ғ, = 0-598. The latter figure compares quite 
well with the original product-moment ғ of + 0:579. 

A still more suitable technique is known as fetrachoric correlation 
(r). This cannot easily be calculated, but it can be read off very 
quickly from specially prepared graphs (cf. Chesire et al, 1933). 
Note, however, that r, can be used only when the variables to be 
compared (although both expressed in the form of two categories) 
can be regarded as normally distributed. It therefore applies to our 
examination marks, but could not be used for sex, or eye-colour, 
and the like. In the present instance, r, = -+ 0-642. The discrepancy 
between this and our r of + 0-579 is probably due to the small 
number of cases and the irregularities of their distributions. 

Biserial r.—When one of the variables to be compared has been 
accurately measured and has yielded a normal distribution, but the 
other (though believed to be normally distributed) can only be 
expressed in the form of two categories, biserial r may be used. For 
instance, if we wish to compare intelligence-test scores with success 
or failure in a scholarship examination, we may not know the 
examination marks, but we do know the intelligence scores of 
scholars and non-scholars. 


Ex. 46.—Assume that the pupils of Ex. 39 are divided into two 
groups according to their history marks—the top 17 and the bottom 
19. The mean geography marks of these groups are 12-23 and 7:89 
respectively. 

пр = Ma — М ура м ZM: P 
с 2 с 2 


M; and M, аге the means of the contrasted groups. In the alterna- 
tive formula, M is the mean of the total group. c is the S.D. of the 
total group, namely 4-819. р and q are the proportions falling into 


17 19 a " 
the groups, 36 and = 2 is the ordinate of the normal curve corre- 


sponding To р (obtainable from statistical tables); in this instance 
gum 1 
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12-23 — 7-89 _ 0-2490 
ғыз 4819 Х 0397 + 0-563. 


Though this result agrees well with our product-moment r, Fais 
tends to be somewhat unreliable and easily distorted by irregularities 
in the distribution. 


THE PROBABLE ERRORS OF CORRELATION COEFFICIENTS 
A correlation coefficient is, like any other numerical psychological 
result, liable to vary on account of sampling errors, especially when 
the sample of persons, whose scores are inter-correlated, is of small 
size. The formula for the P.E. of a product-moment coefficient is: 
0-6745(1 — r°) 


VN 
Applying this to our r of + 0-579, where N = 36, we get P.E. = 
+ 0-075. It is usual to write the P.E. after the coefficient, thus: г = 
+ 0:579 + 0-075. The Р.Е. of a rank order coefficient is slightly 
greater than that of a product-moment one, and is usually calcu- 
lated by the formula: 

0-7063(1 — r?) 

Р.Е.р = VN 
For the determination of the P.E.s or S.E.s of the other coefficients 
mentioned above, the reader should consult more advanced text- 
books, such as Kelley (1924) or Guilford (1936). 

The connotation of the P.E. of r or p is similar to that of an 
average or difference (cf. Chap. V). Thus + 0:579 + 0:075 signifies 
that if the same history and geography papers were set to much 
larger numbers of similar children, and the marks inter-correlated, 
the true value of r so obtained would be as likely as not to lie 
between + 0-654 and + 0-504 (i.e. г + 1 x P.E. andr — 1 x Р.Е.), 
but that there is a 1 in 20 probability that it might deviate as widely 
as + 0-804 or + 0-354 or more (3 x P.E.). 

Just as a difference between averages should be two or more times 
its S.E. to be accepted as statistically significant, so a corrrelation 
should be 3 times its P.E. before it is regarded as showing a real 
relationship between the inter-correlated variables. A coefficient of, 
say, + 0-30 + 0-10 is on the borderline of significance. For if the 
true value of r was zero, the scores of a limited sample of persons 
might yield + 0-30 once in 44 times. 

In our previous illustration + 0-579 is much greater than 3 x Р.Е. 
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Yet, as we have seen, this value is far from trustworthy, the reason 
being the small size of N. Most investigators consider a correlation 
calculated from less than 25 cases to be almost worthless because 
of its low reliability, and they prefer to have at least a hundred cases 
before they base any important conclusions on the results of an 
experiment involving correlations. 

When numbers are small, it is better to calculate / instead of the 
Р.Е. 

= УМ 2 
Da 
For r = 0:30 and N = 36, t = 1-922. Thus this coefficient barely 
reaches the 0-05 confidence level. 

Fisher’s z Method.—The Р.Е. method of indicating the reliability 
of r is inaccurate, not only when N is small, but also when r is large, 
exceeding say 0-40. Fisher (1930) presents a better method which is 
generally adopted nowadays. It consists in transforming r into a new 
quantity called z, whose true S.E. is easily calculated.! Applying 
this method to our г = -+ 0-579, we find that the limits within which 
true r is as likely as not to lie are almost identical with the limits 
derived, above, from P.E.,, but that the 0-05 limits should be -+ 0-762 
and + 0-310 instead of + 0-804 and + 0:354. 

Combining and Comparing Correlations.—The z method also pro- 
vides a useful way of combining two or more correlations. 


Ex. 47.—Suppose that another class of 43 pupils had taken the 
history and geography papers and yielded a cocfficient of + 0:450 
+ 0-082, what is the best estimate of the true correlation? The pro- 
cedure is to determine z, and 2, for the two 75, and then to compute 
the weighted average, 

z(N, — 3) + z(N, — 3) 
МА, — 6 
Translating back from the average z gives us our answer, 7 = 
+ 0:512 + 0:057. 


It is often required to find whether a correlation in one group is 
appreciably different from that in another group, or whether the 
difference can be ascribed to errors of sampling. When the numbers 
of cases are large the stock method may be used, 


1 
ZEE + r) — log, (1 ә} The S.E. of z = MISES Fisher 


provides tables of z for various values of r. 
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оа —4/S.E.r;? + S.E.rj*. 
Ex. 48.—What is the significance of the difference between 
+ 0-579 + 0-075 and + 0-450 + 0-082? The S.E.s of these co- 


А PEs . 
efficients аге the 0:745 ie. 0-1108 and 0-1216. c; = 0-1645. The 


difference is 0:579 — 0-450 = 0-129, which is not as large as its own 
са. A difference as large as or larger than this would occur 43% of 
times as a mere matter of chance. 


As the numbers are small, Fisher's z method should preferably be 
used. This consists in computing 21, 22, and their S.E.s, and applying 
the stock formula to these S.E.s. The difference in 278 is + 0-1763 
in the above example. S.E.z, = 0-1741, S.E.z; = 0:1581. og = 
4/0-1741 + 0:1581? = 0:2352. Here the difference is 0:75 times its 
o,, and it might occur 45% of times as a mere matter of chance. 
Note that these methods do not apply to the difference between two 
correlations in the same group of people (cf. Lindquist, 1940). 


СНАРТЕК УП 


INTERPRETATION AND APPLICATIONS OF 
CORRELATIONS 


A CORRELATION coefficient is never an end in itself, but always a 
means towards the proper interpretation of numerical data. By 
studying the correlations between tests of various abilities it is 
possible to analyse scientifically the nature of such abilities. This is 
one important application, which we shall consider in Chap. VIII. 
But the primary value of a correlation is that it enables us to make 
predictions. If abilities X, and X,-are known to be inter-correlated, 
then we can use measures of X, to predict scores on X, and vice 
versa. For instance a pupil's probable scholastic success may be 
forecast from his performance on an intelligence test. This cannot, 
of course, be done with complete accuracy, but the correlation co- 
efficient will also tell us how accurate is the forecast, or what are its 
limits of error. 


Ex. 49.—The following scattergram shows the Intelligence Quo- 
tients of 440 children and adults on the Stanford-Binet scale, and 
their I.Q.s on the Vocabulary test from this scale. (The group or 
category 140 includes all testees with I.Q.s around 140, i.e. ranging 
from 1354 to 1454; similarly with the other groups.) Here the cor- 
relation is quite a high one, namely + 0-799 + 0-012. Hence we 
would be justified in testing people only with the Vocabulary test, 
and predicting from their results what would be their Binet I.Q.s. 
Let us see how to make such predictions, and how good they 
would be. 


Suppose a child to obtain a Vocabulary I.Q. of 120, his Binet I.Q. 
may, according to the scattergram, lie as high as the 140 class, i.e. 
up to 145, or as low as the 100 class, i.e. down to 96. But presumably 
its most probable value will be the average of the column: 


X, F, 
140 3 
130 12 
120 16 
110 9 


100 8 


INTERPRETATION AND APPLICATIONS 113 


X, — Vocabulary I.Q. 


Average 
2 
130-1 
121-5 
© 115-6 

= 
3 107.6 

а 
а 100 99-3 
| 90 91-2 
x so| 2 824 
70 7 74.0 
60| 3 65:7 

12 
Average 
X, 69-2 |75-0|88-0|91-9|99-9 |109-9|118-5|121-6]128-3/138:0 


This works out at 118-5. Similarly for a child whose Vocabulary LQ. 
is 70, the most likely value of his Binet 1.0. is 75-0, though it may 
lie anywhere between 56 and 95. 

Along the bottom of the scattergram are given the averages of the 
columns, i.e. of the X, measures which most probably correspond 
to the X, measures at the heads of the columns. Fig. 23 is a graph 
of these figures. It will be seen that the points on the graph lie 
approximately on a straight line which runs through X; — 100, 
X, — 100. That they do not all lie exactly on the line is due merely 
to irregularities in the data. For though the numbers involved in our 
example are large, they are still not large enough to yield really 
smooth normal distributions, or regularly spaced averages. 

Now as we approach the extreme values of Ху, the corresponding 
values of X, tend to lag behind; 118-5 is smaller than 120, 138 is 
much smaller than 150. Similarly with values below 100, 75 is not so 
low as 70, 69 is much less low than 60. This is an important and 
characteristic feature of all correlations. It is known as the regression 
of X, towards the mean. The line drawn through the points on the 
graph is called a regression line, or a graph showing the regression 
of X, on Х,. 

Suppose now that the correlation is a perfect one. The scores on 
X, and X, would be identical, and the slope of the regression line 
would be 45°. Next suppose that there is no correlation at all 
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1 І l l 1 1 l l 
50 60 70 80 90 100 по 120 130 140 150 


X; 


Fic. 23.—Graph showing the most probable values of X; for testees who obtain 
each value of X;. 


between X, and X,. It would then obviously be impossible to predict 
Scores on one variable from those on the other. If we took successive 
columns of the scattergram and averaged them, we should find that 
all these means lay close to the mean of Х;. Whatever the value of 
the Vocabulary I.Q., the average Binet І.О. would be round about 
100. In other words, the regression line would be horizontal. From 
these two hypothetical cases we can now see that the nearer the slope 
of the regression line is to 45^, the higher the correlation; the nearer 
itis to horizontal, the lower the correlation. Furthermore, when the 
correlation is perfect, there is ло regression towards the mean, since 
each X, value is as big as each X, value. But when the correlation is 
Zero, the regression is complete, since whatever the value of Xa the 
value of X, is always the mean. Thus the greater the regression, the 
nearer the correlation approaches zero. 

It has been assumed in the previous paragraph that the X, and X; 
measures are expressed in equivalent units. If the dispersions of the 
two distributions are very different, then the slope of the regression 
line corresponding to perfect correlation would not, of course, be 
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45°, However, as shown in Chap. IV, measures can be turned into 

equivalent units if they are expressed as deviations from their means, 

and divided by their standard deviations, i.e. as 21 and Č, If this is 
оү с; 


done the slope of the regression line is identical with r. When the 


slope is 45°, БН = 1-0, that is perfect correlation. When the slope 

xlo 

Хојог 

The slope of the regression line plotted in Fig. 23 is 0:80. о; and 

c, are 17:49 and 17-84 (i.e. the Vocabulary I.Q.s are slightly more 
0-80 x 17-84 


spread out than the Binet 1.0.5). Hence г = == 0-815. 


= 0-0, that is zero correlation. 


is horizontal, 


This is very close to the figure already quoted for the correlation, as 
determined by the product-moment method. As a matter of histor- 
ical fact, r was originally determined by plotting the regression line 
and measuring its slope, before statisticians invented the product- 
moment method. The latter method is always employed now because, 
as we have seen, the regression line obtained by averaging the 
columns is somewhat inexact, even when N is large. It is more useful 
to reverse the historical procedure, to calculate оу, оз, and r, and 


Ne ха; 
ression line. If 2L =r, then 
Хо: 


from them to determine the reg 
2"1 


х= rx, 21, This is the algebraic equation of the regression line, and 
| es 
it can be used for predicting any individual's most probable x, score 


from his x, score. Alternatively, if we wish to predict in terms of 
original measures, instead of measures expressed as deviations from 
their means, the formula becomes: 


(X, — М) = (% — M32. 


Ex. 50.—What is the most probable Binet I.Q. corresponding to 
a Vocabulary І.О. of 140? According to the averages listed at the 
foot of the scattergram, the answer is 128-3. But this is based on only 
6 cases and is therefore very unreliable. Applying the formula: 
17-49 
X, — 100 = 0:799 (140 — 100) х 1784 


X, = 131-3 is the correct answer. 
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Regression of X, on X,.—So far we have dealt only with the pre- 
diction of Ху, knowing X;. The opposite is equally possible. On the 
right-hand side of the scattergram in Ex. 49 are listed the average 
values of X, which constitute the best estimates of the Vocabulary 
I.Q.s of persons with various Binet I.Q.s. Here, too, there is regression 


Regression 
of X? on 


X, 


Regression 
of X, on X; 


X; 
FiG. 24.—Regression lines when r — 
+ 0:799. 


Regression 
of Хуоп X, 


Regression 
of X, on Ху 


X 
Fic. 25.— Regression lines when г = 0-0. 


towards the mean. The 
probable Vocabulary 1.0. 
corresponding to a Binet 
1.0. of 120 is 115:6, and 
for a Binet 1.0. of 70 it 
is 740. A second graph 
may therefore be drawn, 
showing the regression of 
X, on X, Fig. 24 shows 
the two lines diagram- 
matically. If the correlation 
was perfect the two lines 
would coincide, both hav- 
ing a slope of 45° (provided, 
always, that X, and X, are 
measured in equivalent 
units). If the correlation was 
zero, the second line would 
be vertical, as in Fig. 25, 
since the only possible es- 
timate of Vocabulary I.Q. 
from Binet 1.О. would be 
100. With an intermediate 
correlation, such as our + 
0-799, the second line 
makes the same slope with 


the vertical as does the first regression line with the horizontal; 
and the two cross at the means of X, and X,. Thus the alge- 
braic equation of the second line, which is also the formula for 


ET, . ©, А 
predicting X, from Ху, will be: x, = rx,—, ог, in terms of scores, 
as 


(X, — Mj) = (X, — M, =, 


1 


Ex. 51.—What is the probable Vocabulary 1.0. of а testee with 
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Binet I.Q. 80? X, = 100 + 0-799 (80 — 100) x = 83:7. 
The estimate based on the scattergram was 82:4, which agrees 
quite closely with our answer. 


P.E. of Estimated Scores.—The regression equations will tell us 
the most probable estimate of X, from Xa, or of X, from X,. But it 
is obvious from the scattergram that the actual X, may fall within 
quite a wide range on either side of the estimated X;, and that 
similarly an estimated X, is only the average of a wide range of 
possible X,'s. The ranges may be determined from the formule: 

P.Evestimx, = 06745 I — г 

P.Evestim.x, = 0674501 — r° 
Applied to Exs. 50 and 51: P.E.estim.x, = 0:6745 x 17:494/1 — 0:7993 
= 7-1, P.Evestim.x, = 7:3. 

The P.E. or S.E. of an estimated score has a similar meaning to 
those already considered, as may be shown by the following example. 
Consider a very large group of persons whose Binet I.Q.s are 110, 
and whose estimated Vocabulary I.Q.s are 108-1 + 7.3. Then one 
half of them should have Vocabulary I.Q.s lying between 115-4 and 
100-8, and all but 1 in 100 should lie within a range of 2-58 x S.E., 
i.e. about 4 x P.E., or between 136 and 80. 

Actually we already know the Vocabulary I.Q.s of 71 persons 
with Binet I.Q.s round 110. This is not a very large group, but it 
should serve to test out our prediction. The scattergram is too coarse, 
but looking back to the original scores it appears that 37 of them, 


© or 52%, had Vocabulary I.Q.s between 115 and 101 (inclusive), and 


that the two extreme Vocabulary I.Q.s were 138 and 80. This agree- 
ment is quite close. 

It is important to note that the P.E. of an estimated score de- 
creases as r increases. If the correlation were perfect there would be 
no error at all in estimating X, from Xa or X from Ху. The lower 
the correlation, the less accurate will be the predictions. 


Ex. 52.—As an additional illustration, take the results from the 
36 history and geography papers. Here N is far too small for accurate 
regression lines to be obtained by averaging rows ог columns. But 
by the use of the formule given above, probable geography marks 
can be estimated from history marks, or vice versa. For instance: 
what geography mark (Хз) corresponds to a history mark (Ху) of 7, 
and what is its P.E.? 

M.A.—9 
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M, = 14139 M; — 9:945 
өү = 3:903 о = 4819 
т = + 0-579 
4819 
(X, — 9-945) = 0-579(7 — 14-139) x 3:903 
X, = 9:945 — 5-103 = 4:84 
P.Evestim.x, = 0:6745 x 4-819 /1 — 0:579? 
== 2:65 
Therefore X, = 4-84 + 2-65. 
Here the correlation is only moderate, hence the estimate is poor 
in reliability. All we can claim is that a pupil with a history mark of 


7 would be very unlikely to get a geography mark as high as 15 
(X; + 4 x Р.Е.), and would most probably get round about 2 to 7. 


INTERPRETATION OF CORRELATIONS 

In the light of the above discussion it is possible to give a much 
more precise meaning to the conception of correlation than here- 
tofore. A correlation does not mean, as is sometimes supposed, the 
percentage agreement between the correlated variables. In the 
scattergram on p. 113, for example, only 374% of the testees get the 
same 1.0., or rather the same class of I.Q., on Binet and Vocabulary, 
although the correlation is + 0-799. It is possible to interpret a 
coefficient in terms of the number of factors or elements which are 
common to the two variables, which therefore cause them to cor- 
relate with one another, but there is little practical point in so doing. 
What a correlation does indicate is the amount of reduction of error 
in predicting scores on one variable from scores on the other variable. 
When r is zero, the error is at its maximum, namely 0:6745о. Pre- 
dictions based on X, might fall anywhere within the whole range of 
X; scores, and would be no more accurate than drawing X, scores 
out of a hat. With a correlation of 1-0 the error is reduced to zero. 
The extent of reduction of error for intermediate r’s is shown in the 
following table. The second column lists values of 100 (1 — 4/1 — 7°), 
and so represents the reduction in percentage terms. These figures 
are sometimes referred to as the forecasting efficiencies of the r's. It 
will be seen that the forecasting efficiency only becomes high enough 
to be of great practical value when r is in the neighbourhood of 0-8 
to 0-9 or more. Great caution should be exercised therefore before 
predicting, say, success at a job from aptitude tests, since these 
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usually give correlations of only + 0-4 to + 0:6 with what we wish 
to predict. Such deductions are only 8% to 20% more accurate than 
pure chance guesses. 


Reduction of error, or fore- 
casting efficiency 75 


P 

0-00 0 

0-10 0-5 
0-20 2-0 
0-30 46 
0-40 84 
0-50 13-4 
0-60 20-0 
0-70 28:6 
0-80 40-0 
0-90 564 
0-95 68-8 
1-00 100-0 


An alternative method of interpretation is suggested by McClelland 
(1942). Suppose we are selecting the top 20 of a group of 100 can- 
didates, either for secondary schooling or for a job, on the basis of 
their intelligence test or selection battery scores. We know that the 
test gives a correlation of r with later success. What proportion of 
candidates is correctly allocated, or how many do we wrongly select 
and wrongly reject? The position for various levels of r is shown in 


the following tables. 


Success- Unsuc-| 
ful later cessful | 


Pass test | 20 0 20 4 16| 20 14 6| 20 
Fail test 0 80 80 16 64, 80 6 74, 80 
20 во |100 20 80 | 100 20 0 100 

г = 1:00 г = 0:00 г = 0:85 


In the first table all the best 20 on the test turn out well and the 
bottom 80 badly ; correlation is perfect. In the second, r is zero and 
our selection is no better than picking names out of a hat, since as 
large a proportion of the 20 who pass do well later as of the 80 who 
fail the test (namely 20% of each). Nevertheless, 64 and 4 are 
correctly allocated, and the proportion of errors is 32%. The third 
situation represents the typical situation in secondary school 
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selection, where the battery of intelligence and attainments tests cor- 
relates about 0-85 with later school work. The number of correct 
allocations is now 88, and of errors 12%. (Note, however, that nearly 
one-third of the 20 Passes who actually get into the grammar school 
gain unsatisfactory marks later. Thus we have reduced errors by 
32 — 12 
32 
levels of r. 


= 624%. Here are the corresponding reductions for other 


Reduction of errors 
in prediction % 


-fooocoocoooooo 


Suo за 0 ЧЧ 
853555555555 
~ 
e 


The Influence upon Correlations of Irrelevant Common Factors.— 
A positive correlation which is statistically significant (i.e. 3 or 
preferably 44 times its P.E.) always indicates some kind of causal 
connection, or some common factor or factors running through the 
two variables—even when the connection is not strong enough for us 
to be able to forecast one variable from the other. Considerable care 
is needed, however, in interpreting such a connection, for its real 
nature may be quite unsuspected, or the common factors may be of 
an entirely irrelevant kind. Suppose, for example, that we correlated 
the size of boots worn by all the children in a school with the speed 
of their handwriting, we should undoubtedly obtain a moderately 
high coefficient. Yet it would be absurd to say that big boots cause 
quick handwriting, or vice versa. The real reason for the correlation 
is an irrelevant common factor, in this case age. The older children 
on the whole write more quickly and have larger feet than the 
younger ones. If we took children all of the same age and compared 
boots with handwriting, we should be likely to find that the corre- 
lation had fallen to zero. 
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Consider another, rather more complex, illustration. A positive 
correlation exists between backwardness in school work and un- 
employment among the children's parents. Idealists may at once 
jump to the conclusion that unemployment brings about malnutri- 
tion and anxiety among the children, and that these make their work 
fall off. But here also there is an important common element which 
goes some way to explain the connection. Unemployed parents are 
known to be on the whole less intelligent than employed ones, and 
unintelligent parents tend not only to have unintelligent children, 
but also to give them less educational encouragement than do in- 
telligent parents. These factors—low intelligence and generally un- 
favourable environment—are far more important causes of back- 
wardness than is malnutrition. We need to take two groups of chil- 
dren, one with parents employed, one unemployed, groups whose 
average intelligence level is identical, and then study their school 
work to see whether the former do better than the Jatter. Only when 
we have ‘held the intelligence factor constant’ in this example, or 
‘held the age factor constant’ in the previous example, can we 
legitimately deduce some causal connection. 

Partial Correlation.—The above type of problem or difficulty in 
interpretation is constantly occurring in psychology and education. 
Sometimes it can be met by applying what is known as the partial 
correlation technique. For this it is necessary to obtain measures of 
the suspected irrelevant factor, which we shall call Ху, and to work 
out the correlations between X, and X; (745), and between X, and 
X; (rj), as well as between X, and X, (т). We may then ‘hold X; 
constant’ or ‘partial it out’ by the following formula: 712.3 (meaning 
the correlation of X, with X, when X; is eliminated) = 


ie fases 
Va — rs V1 — ros? 


Ex. 53.—Suppose the correlation between boots and handwriting 
speed to be + 0:45, that between boots and age + 0-70, and between 
handwriting and age + 0:60. Then the correlation between boots 
and handwriting when the influence of age is removed is: 


1 Thouless (19392) has pointed out that the partial correlation formula can 
only legitimately be used if X; can be measured without error. It is thus most 
appropriate for holding age constant, but not for, say, intelligence; since 
measurements of the latter always involve some error. However, Thouless pro- 
vides an alternative formula which may be used when the reliability of X; is 


known. 
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0:45 — 0-70 x 0-60 
М1 — 070*4/1 — 0-602 

The Effect of Homogeneity and Heterogeneity upon Correlations.— 
We have just seen that when pupils are heterogeneous as regards age, 
the correlation between a pair of their characteristics may be 
spuriously raised. The same is true of other types of heterogeneity. 
Suppose we calculate the correlation between two tests in an ordinary 
school class, and then eliminate from consideration most of the 
pupils of medium ability on both tests, and re-calculate the correla- 
tion, the second figure will certainly be larger than the first. The 
distributions will not, of course, be normal, the diversity or hetero- 
geneity of the pupils will be unduly large and this will increase the 
size of r. In ordinary practice the reverse, namely undue homogeneity, 
is more likely to occur without the investigator realizing it. Thus the 
correlations between intelligence tests and school work in an ordinary 
school class are usually lower than they would be in an unselected 
group of children of the same age. For such a school class is more 
homogeneous as regards intelligence and school work than is the 
population as a whole. The dullest children of that age are likely to 
be at a special school, and many of the brightest ones may be at 
private schools. Take an extreme case: if we selected a group of 
children all of precisely the same intelligence, any correlation be- 
tween intelligence and work would disappear. One important reason 
why intelligence tests appear to be of much less value for predicting 
scholastic aptitude among secondary grammar than among primary 
pupils, and to be poorer still among university students, is that 
secondary pupils are more homogeneous or more highly selected 
than primary. Most of the children of I.Q. less than 110 are weeded 
out before the grammar school is reached, and university entrance re- 
quirements remove many more from the middle and lower end of the 
scale. Thus correlations necessarily sink as we pass from unselected 
children to the primary school, from the primary to the grammar, 
and from the grammar to the university level. The typical correlation 
of + 0-85 which we have quoted for selection tests at 11 + with 
later school work drops to about + 0-45 within a grammar school 
group. 

It is not easy to decide just what degree of heterogeneity should be 
regarded as normal and reasonable, nor to specify the extent to 
which any given group of persons exceed or fall short of this degree 
of heterogeneity. Hence the correction of correlations either for 


r + 0-052. 
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undue homogeneity or excessive heterogeneity is complicated. The 
reader may be referred to discussions by Thomson (1939) and 
Thorndike (1949). 

The main conclusions to be drawn from the above sections is that 
the investigator should always scrutinize his correlational data very 
carefully, first in order to see whether any irrelevant common factors 
ncreasing spuriously the size of his coefficients, secondly to 
the diversity of scores in the group tested is 
ricted. He should realize that the absolute size 
le, since this size may be so much 
factors or heterogeneity. His inter- 
lationship between the correlated 
rticular data with which 


may bei 
judge whether or not 
unusually wide or rest 
of a correlation means very litt 
altered by uncontrolled common 
pretation of the amount of re 
variables should be made relative to the pa 


he is working. 
In contrast to the absolute size, the predictive value or forecasting 


efficiency of a correlation is not affected by heterogeneity. For the 
latter, as we have seen, depends on ov/1 — г. Hence an increase in 
r on account of excessive heterogeneity is offset by the increase in о, 
and the P.E. remains the same. It should be noted further that fore- 
casting efficiency is not in the least upset by irrelevant common 
factors. True, it might be foolish to predict speed of handwriting from 
the size of boots, yet the correlation obtained between these variables 
would provide a valid means of making such predictions. 


COMBINING TWO OR MORE TESTS 
ational statistician is frequently faced with the problem 


The educ 
examinations and tests, and using 


of combining the results of several 
the total scores for the selection of the best pupils. We have already 


dealt rather fully with the point that the distributions to be combined 
should be expressed in equivalent units, i.e. that they should possess 
the same means and S.D.s, except when it is desired to give more 
weight to one set of marks than to another, in which case the former 
should have the larger S.D., not necessarily the larger mean. Here 
we must draw attention to an important fact which markers seldom 
realize, namely that the S.D. or dispersion of combined variables is 
always less than that of the separate variables by an amount depend- 
ing on the lowness of correlation between these variables. This fact 
is a natural corollary of the facts of regression outlined above. 
Suppose, for example, that total marks are based on the average 
of six examinations, each with a mean mark of 60, a range of 30 to 
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90, and a S.D. of 10. If the average correlation between the six 
variables is + 0-70, the final distribution will probably have a S.D. 
of 8:66 and the average totals will range from 34 to 86. But if the 
average inter-correlation is only + 0-30, the final distribution will 
have a S.D. of 6:43, and the range will drop from 30-90 to 41-79. It 
is easy to see why this must be so. When the average inter-correlation 
is moderate or low, no pupil or student will get very high marks, or 
very low ones, on all the examinations. Hence the highest and lowest 
average marks will be distinctly nearer to the mean than the highest 
and lowest marks on each separate examination. The same phenom- 
enon occurs when marks on different questions in a single exam- 
ination are totalled. For instance, if five questions are marked, each 
with a range of 5-20 out of 20, the examination totals will probably 
range from about 40-85, and not from 25-100. 

A chief examiner, head teacher, or other marking authority should 
make allowance for these facts. If he wishes the distribution of final 
totals to show a certain proportion of very high (over 80%) marks, 
and а certain proportion of low (under 40%) marks, then he must 
either direct the markers of the component examinations to employ 
extra-wide distributions (i.e. to award much larger proportions of 
marks over 80% and under 40%), or else he must re-scale the final 
marks so as to spread them out again into a distribution with the 
required dispersion. The formula connecting c, the S.D. of the com- 
ponent distributions, and c,, the S.D. of these distributions averaged, 


isse == 6 J ккк where R is the average correlation be- 


tween all the sets of marks, and л the number of sets." 

Multiple Correlation.—If two tests or examinations each correlate 
moderately with a third, then the two combined will usually correlate 
higher with the third than either separately. For example, English 
and arithmetic examinations at 11 + usually yield correlations of 
about -+ 0-40 with subsequent achievement in grammar schools. The 
combined mark is likely to correlate about + 0-45 with such achieve- 
ment. Every educationist has realized and applied this principle. 
With the aid of statistical methods we can formulate it more precisely. 

Let there be n tests which give correlations rj, Гол, Пад + + + €tC^ 
Osum 

n 
total scores. The same formula, turned around, is often useful for finding the 


average inter-correlation of n variables when their separate and combined 5.0.5 
are known. СЇ. Kelley (1924), Formula No. 171. 


! Alternatively, may be substituted for Gay, where Osum is the S.D. of the 
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with some measure of achievement, А, the sum of all these coeffi- 
cients being Zr, ; and let the correlations of the tests with one an- 
other be гуу, Рз, Fos - - +» Etc., their sum being Zr» Then the correla- 
tion of А with the combined tests is: 
Zn, 
r, „Ба о» ЧЕ ај = пе 
(а + а ad Fa) мп + 27 


Ex. 54.—Three different types of intelligence tests gave correla- 
tions of + 0-483, + 0:456, and + 0-337 with students’ marks in 
psychology. Xr,, = 1-276. Their inter-correlations were + 0:507, 
-- 0-468, and + 0-508. Zr, = 1-483. What will be the correlation 
of the combined tests with psychology marks? 


Сър € — 0 
га + аа) = уза 2 х 1483 _ ; 


The above method is a simple but crude way of combining tests 
so as to give better predictions of achievement. Naturally some of 
the tests are superior to others, and so should receive more weight 
in the total score. Further, the most useful tests will be the ones which, 
as well as correlating highly with A, do not correlate highly with the 
other tests that are being employed. For obviously, if tests 1 and 2 
are measuring almost precisely the same thing, they will predict, as 


. it were, the same aspect of A. Whereas what we want are tests which 


predict different aspects, and so between them cover the whole of A 
more thoroughly. Thus it may be better to leave out test 2, and 
substitute another which, while perhaps correlating less well with A, 
has the advantage of being almost independent of test 1. The method 
known as multiple correlation enables the educationist or psycholo- 
gist to choose from a number of tests those which will in combination 
give the highest possible correlation with A, and to weight them 
suitably. Multiple regression equations can also be calculated which 
predict an individual's most probable score on A by combining his 
marks on the separate tests in appropriate proportions. The method 
is too elaborate to be given here, but the above discussion will, it is 
hoped, indicate its object and main principles. 

The vocational psychologist applies an analogous procedure in 
selecting the best candidates for a certain occupation, that is, he 
gives a series of tests each of which has been found to correlate 
moderately with success at the occupation, and combines them by 
the multiple correlation method so as to obtain as accurate predic- 


tions as possible. 
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TEST RELIABILITY 


If a test or examination is applied a second time under similar 
conditions, and the testees' scores differ widely from those previously 
obtained, the test is obviously a poor one. It is said to be reliable 
only if the two sets of scores correlate highly with one another. 
Further, if different testers apply and score a test or examination, 
they should arrive at the same, or nearly the same, scores. Repetition 
of a test may, however, give an unfair picture of its reliability, since 
the testees may remember their previous responses. Many tests are 
therefore supplied in two or more parallel forms, so that if a re-test 
is desired, different questions may be set which should, nevertheless, 
yield much the same results as did the questions of the first test. 
When no alternative form is available, a single test is often split into 
two equivalent halves. For instance, the scores on odd- and on even- 
numbered questions or items may be totalled separately, and then 
inter-correlated. But it is known that test reliability depends on the 
length of the test, hence the correlation between one half and the 
other half is unduly low, and is usually corrected by the formula: 
Rie 2r 

~T+r 
to be expected had it been possible to compare the whole of the test 
with another similar test. 

In all these instances—namely repetition, application and scoring 
by a different tester, parallel form, and corrected split-half—we 
generally expect to get an inter-correlation of at least + 0-90. For if 
this reliability coefficient is much lower it would indicate that the 
scores are too unstable to be trusted. Standardized educational and 
intelligence tests generally reach this level of reliability, but ordinary 
examinations often fail to do so, as we shall see in Chap. XI. Many 
performance tests, tests of occupational abilities, and character or 
temperament tests, are also often poor in reliability. However, by 
increasing the length of a test, or by combining several parallel tests, 
a more satisfactory coefficient may be attained. 

Knowing the reliability coefficient, it is possible to calculate the 
reliability, or the limits of variation, of individual scores, by the 
formula already given: P.E.estimx, = 0-6745e, Vl — r*. For ex- 
ample, if parallel intelligence tests give a correlation of + 0-93, and 
the S.D. of their 1.055 is 15, then Р.Ело, = 0-6745 x 15V/1 — 0-93? 
= 3-71. But if the reliability coefficient is only + 0-70, Р.Ело. = 


. Here r is the obtained coefficient, and R the coefficient 
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7:21, that is almost twice as large. Such a P.E. is, as we have already 
seen, an index of the extent to which scores on the second test may 
vary when predicted from scores on the first test. About half the 
testees will obtain LQ.s on the second test within 4 points (if 
r — 0:93), or about 7 points (if r — 0-70), of their I.Q.s. on the first 
test. But rare cases (1 in 100) may alter by as much as 4 x P.E., 1e 
by some 15 and 29 or more points respectively. 

The application of two parallel forms of a test to the same indi- 
vidual yields, as we have seen, somewhat different results. If we 
possessed a large number of additional parallel forms and applied 
them, still other results would be obtained (quite apart from effects 
of practice). Such results for one individual would tend to conform to 
a normal distribution, and their grand average would, of course, be 
the best possible measure of the individual's standing on the test— 
what may be called his true score. The situation is exactly analogous 
to that which we discussed in connection with group differences 
(Chap. V), except that there we were concerned with variations 
among the means of sample groups, here with variations among an 
individual's scores on sample tests. An individual's score therefore 
possesses another P.E., which indicates its liability to deviate from 
his true score. The formula for this also involves the reliability of 
the test, and the S.D. of the test scores of an unselected group of 
E. = 0:6745о%1 — r. Thus the P.E.s of I.Q.s for tests 
with reliability coefficients of + 0-93 and + 0-70 are 2-67 and 5:53 
respectively. Note that these P.E.s are smaller than those quoted 
above, since МТ — r is necessarily smaller than МТ — r°. This is to 
be expected, since the former P.E. represents the deviation of one 
sample score from another sample, whereas the latter represents the 
deviation from the mean of all the possible samples. 

Factors Influencing Test Reliability.—Since the reliability of a test 
is almost synonymous with its thoroughness, it can be increased or 
lowered almost indefinitely by lengthening or shortening the test. 
The formula for predicting the reliability of a test whose length is 
doubled has already been cited. When it is lengthened л times, the 
formula becomes: R = Ез = This is known as the Spear- 
man-Brown prophecy formula. 

Ex. 55.—A short educational test has a reliability coefficient of 
-L 0-60. How much must it be lengthened to attain a coefficient of 
-- 0-90? Turning the above formula round, we get: 


persons. P. 
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..Rü-r _ 0:90 х 040 6 
rd = В) 0-60 x 0:10 
It should be six times as long. 

The same calculations would apply to the following example. 
If two examiners mark a set of examination scripts, and their marks 
inter-correlate + 0-60, how many examiners should mark the scripts 
for their combined marking to attain a reasonable reliability of 
+ 0:90? The answer is six. 


The formula assumes that the correlations between the additional 
tests (or examiners) would be the same as between the first pair of 
tests (or examiners). This condition seems generally to be satisfied 
by educational measurements. 

A second important factor is the heterogeneity of the group whose 
scores provide the reliability coefficient. Parallel forms of an in- 
telligence test will inter-correlate much more highly if applied to 
school-children of several years’ age range than when applied to a 
single class or to a single year-group (i.e. children whose ages range 
over only one year). Coefficients obtained from the latter groups are 
conventionally accepted as fairer indices of reliability than co- 
efficients from the former. If we know о, the S.D. of test scores in a 
group which is unduly homogeneous or heterogeneous, then it is 
possible to predict R, the reliability to be expected in a group with 


S.D. 2, by Kelley’s formula: R = 1 — ra (1— r). 


Ex. 56.—The reliability coefficient of an intelligence test when 
applied to a class of primary school-children was + 0:88. The S.D. 
of their I.Q.s was 14. What would be its reliability in a secondary 
school class whose I.Q.s have a S.D. of only 9? 

R=1-4 xom- +4071 

Note that the P.E. of an individual’s score is not upset by this 
factor, as is the reliability coefficient. In the above example it works 
out at + 3-27 in either class. 

The assessment of test reliability and its implications are complex 
topics, and the account given here covers only the more elementary 
points. Advanced students are advised to consult Thorndike (1949) 
and Gulliksen (1950). 


Function Fluctuation.—Most reliability coefficients fall below 1-0 
for two distinct reasons, first, through the test being insufficiently 
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thorough (i.e. liable to errors of sampling), and, secondly, because the 
persons tested or examined may altér in their level of achievement 
between one test and another. The latter factor has been termed 
‘individual variance’, or, ‘function fluctuation’ (Thouless, 1936). 
Naturally it has a greater effect on coefficients obtained from parallel 
tests given on different occasions, or from repeated tests, than it 
does on coefficients obtained by the split-half method. Hence the 
difference between these two types of coefficient affords a means of 
distinguishing between the reliability of the test itself, and the ten- 
dency of testees to fluctuate in respect of the ability tested. Thouless 
provides a formula for estimating function fluctuation from this 
difference. 

Effects of Test Reliability on Test Inter-correlations.—A test which 
is not perfectly reliable may be regarded as measuring so much of the 
function which a perfect test would measure, and so much error. 
The error part of it is quite useless, and will have the effect of re- 
ducing the correlation between this test and any other test. If 
both tests possess considerable errors through lack of reliability, 
their inter-correlation will be considerably reduced. In fact, the 
maximum possible value of such an inter-correlation will be, not 10, 
but the square root of the product of their reliability coefficients. 
Suppose we find a correlation of + 0-50 between a selection ex- 
amination and subsequent achievement, and that the reliabilities of 
the examinations and the measure of achievement are only + 0:70 
and + 0-80. Then the true correlation, if it were possible to obtain 
perfectly reliable exams and measures of achievement, would be 


+ 0:50 
V0-70 х 0:80 _ оер 
Spearman has called this reduction effect of imperfect reliability— 
‘attenuation,’ and has provided formule by which it may be corrected 
or compensated. Such correction is, of course, of theoretical 
interest, since it tells us what correlations to expect between various 
psychological abilities if they could be accurately measured. But it is 


of little practical value, since completely accurate tests are un- 
attainable, and for all purposes of prediction we have to employ the 
fficients (cf. Thouless, 19394). 


obtained, uncorrected, correlation coe 


CHAPTER VIII 


ANALYSIS OF ABILITIES 


Fallacious Views of Mental Organization.—Correlational investiga- 
tions have assumed especial importance in recent years, since they 
enable us to clarify our conceptions of human abilities, and of the 
organization or structure of the mind. The layman's notions as to 
what abilities exist, and what each ability includes or excludes, are 
extremely loose, and the same was true of psychological theory 
throughout most of last century. Teachers and parents are heard to 
remark: “Johnny is poor at book-learning, but he makes up for it 
by his cleverness with his hands. He will be a good mechanic when 
he grows up." “Магу has an excellent memory and good power of 
attention.” ‘Willie is slow but sure,” and so on. Such statements аге 
now known, as a result of experimental investigation, to be largely 
fallacious. Only exact study can tell us how far the slow person is 
likely to be sure, whether manual ability in a boy has any predictive 
value for mechanical ability in a man, whether there is any such 
entity in the mind as memory. Again, the so-called faculty school of 
psychologists of a hundred years ago considered the mind to be 
made up of a number of distinct powers or faculties such as reason- 
ing, judgment, imagination, etc. Indeed, it was at one time regarded 
as the main object of education to train each of these faculties by 
providing children with appropriate ‘mental gymnastics.’ As soon, 
however, as scientific psychologists tried to define the faculties pre- 
cisely and measure them, and to determine the effects upon them of 
various types of training, they fell into discredit. A priori analysis has 
not even been able to decide what is the true nature of intelligence. 
The accounts of it given by different writers have been extremely 
varied and discrepant. Nowadays, therefore, such problems, both of 
theory and practice, are approached by means of experimental 
studies of correlations between tests. For they all eventually resolve 
into the question—what correlates with what? 

The Scientific Definition of an Ability.—First we must define what 
we mean by an ability, capacity, or faculty. It implies the existence 
of a group or category of performances which correlate highly with 
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one another, and which are relatively distinct from (i.e. give low 
correlations with) other performances. Take, for example, mechan- 
ical ability. Some people are better than others at tasks involving 
manipulation of mechanisms, and this ability is fairly consistent or 
general in the sense that those who are good at one such task are also 
usually good at others. The consistency is not perfect. A may be 
especially good with meccano, less clever with locks, B the opposite, 
C the best with electrical fittings, and so on. But if there was no 
appreciable correlation between these and other similar perform- 
ances, if they were all found to be specific, we should not be entitled 
to regard mechanical ability as a real entity. Another important 
condition is that the performances should be fairly reliable. We do 
not expect perfect stability, but if people fluctuated wildly in their 
success at mechanical tasks from day to day, we should hardly 
recognize the existence of mechanical ability. There is a further possi- 
bility, namely, that the various performances may inter-correlate 
well, but that the correlations may be accounted for by some other 
common factor such as general intelligence (the age factor, we will 
assume, has already been eliminated). Mechanical ability would not 
then be anything distinctive. However, intelligence tests can be 
applied to the same persons who take the mechanical tests, and 
intelligence can be partialed out or held constant. It is then found 
that the mechanical tests still overlap, or show positive correlations 
with one another over and above their correlations due to the intelli- 
gence factor. We are, therefore, able to accept mechanical ability as 
something consistent and distinctive. Tests of it do correlate suffi- 
ciently well and reliably for us to postulate some common element 
running through them. 

A common element such as this is often referred to as a group 
factor, since it occurs in a group of performances of a certain 
restricted type. It differs from a general factor like intelligence, 
which is found to run through an extremely wide range of tests, which 
indeed enters to some extent into all abilities. 

Since the consistency or overlapping of different mechanical tests 
is not perfect, no one test can give a really adequate measurement of 
mechanical ability. But by combining several tests we are likely to 
get a result much more representative of the group factor as a whole. 


The same applies, of course, in the measurement of educational 
-division sums alone to tell us how 


abilities. We do not expect long 
good pupils are at arithmetic, although they provide some indication 
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of the ability. Instead we set several types of sums, and try to cover 
the whole field of arithmetic which the pupils are supposed to know. 

Arithmetic is likely to yield a group factor similar to the mechani- 
cal ability factor. But many of the faculties assumed by psychologists 
in the past, or by laymen at the present time, fail to stand up to the 
criteria enumerated above. Memory, for example, refers to so many 
different things, that it is meaningless to talk of so-and-so as havinga 
good or bad memory in general, or of *training children's memories.’ 
The speeds with which people learn various sorts of material, and 
their retentiveness (i.e. the amounts of such material which they can 
recall after an interval), give only moderate or low inter-correlations, 
especially when the influence of intelligence is removed. Probably 
there exist several small group factors, each representing memory 
for a certain narrow range of material.’ In other words, there may be 
several different types, but no single entity, of memory. 

The notion that mental speed differs from, and is usually opposed 
to, mental accuracy or power also appears to be fallacious. If there 
were two quite distinct abilities, then tests of speed should give high 
correlations with one another, and low or negative correlations with 
tests of accuracy or power. Under appropriate conditions of testing, 
difficult untimed tests and easy speeded tests do yield somewhat 
different results; thus a partial distinction is indicated. But they still 
correlate positively to a moderate or a high degree, showing that, on 
the whole, those who are ‘slow’ at mental work are more likely to be 
‘unsure’ than ‘sure.’ 

All Mental Abilities are Positively Inter-correlated.—Let us now turn 
to the wider problem—what abilities does the mind contain, and how 
are they organized or inter-related? A first, extremely important, fact 
is that all tests of mental abilities tend to give positive inter-correla- 
tions. Even tests of manual, physical, and other non-intellectual 
functions also usually correlate positively with one another and with 
mental tests, though the coefficients are often very small, whereas 
the coefficients among tests of intellectual functions are generally 
moderate to high. In selected groups, such as university students, 
occasional negative correlations may arise ; but these are not typical, 
and they are seldom statistically significant. In other words, talent and 
versatility, more often than not, go together; a person outstandingly 

1A summary of the evidence on this and other types of ability, obtained 


between 1940 and 1950, is given in the writer’s book, The Structure of Human 
Abilities (1950). 
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good or bad in one field is usually above or below average, re- 
spectively, in all other fields. This fact at once serves to cast a great 
deal of doubt on the ‘compensation theory,’ i.e. the view that those 
who are poor at one thing make up for it by being good at something 
else. It does not entirely disprove it, for the following reasons. 
First the reader should study the scattergram for a very low 
positive correlation, such as that shown in Fig. 26. This compares 
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Fic. 26.—Scattergram and regression lines illustrating а low correlation, 
r= + 0:146. 


the marks of students in an examination in hygiene with tl.eir in- 
telligence-test scores, and yields an г of + 0-146. It will be scen that, 
though there is still a slight tendency for high scores on X, to be 
associated with high scores on Х,, yet the correlation admits of many 
big exceptions. Students who are among the highest 37; for intelli- 
gence may range right down to the lowest 18% for hygiene. The 
large amount of regression means that a fair proportion of persons 
can be high on one variable, low on the other. Such a state of affairs 


M.A.—10 
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is certainly much more likely to occur here than it is when the cor- 
relation is high (as in Fig. 23). 

Now ‘book-learning’ and ‘being clever with the hands’ are not 
highly correlated, hence it is quite possible for some children to be 
much better at one than at the other. But it is definitely false to 
suppose that they are inversely related, that those who are bad at 
one are therefore good at the other. On the contrary, those who are 
superior in intellectual abilities are on the average superior in prac- 
tical ones also, though to a lesser extent. 

A second reason is that age differences often obscure the issue. 
Some teachers are incredulous when told that children of superior 
intelligence tend to be superior also in growth and physical capacities 
to dull children, since they know well that the dullards in an ordinary 
school class are usually larger and stronger than the bright young- 
sters. But, then, they are forgetting that the dullards are far older 
than the bright ones, and so may be physically superior. If the dull 
children were compared with bright ones of the same age, the 
physical superiority of the latter would at once be obvious. Never- 
theless, in this instance also the correlations are so low that a great 
many exceptions are possible. 

Thirdly, ability is obviously to some extent a matter of interest 
and practice. Pupils who are backward at intellectual studies and 
who are on the average backward, but much less so, in physical and 
manual pursuits, naturally devote more time and energy to such 
pursuits. Bright pupils, though initially stronger and better at prac- 
tical matters, may be more interested in intellectual matters, and so 
fall behind in other things. It is possible, therefore, for the differences 
that exist between abilities which give low inter-correlations to be- 
come accentuated, though there is still no evidence known to the 
writer that the correlations ever become negative. 

We may conclude, then, that all the abilities with which teachers 
are concerned are, to a greater or lesser extent, positively correlated. 
As an example, we will quote some of Burt’s (1917) results. He gave 
thirteen objective tests of achievement in various school subjects to 
120 children aged 11 +, and calculated the inter-correlations. The 
following table shows the average correlation between each test and 
the other twelve tests.’ 


1 The average reliability of the tests was + 0-738. All these coefficients would, 
therefore, be larger by roughly 35% if the tests had been perfectly reliable (cf. 
p. 129). A later repetition of the investigation with a much larger number of 
children has yielded closely similar results (Burt, 19394). 
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Composition, history, geography, nature study, and arithmetic 
problems overlap with the other tests to the largest extent. Mechani- 
cal arithmetic, drawing, and writing quality correlate to the smallest 
extent. This shows that there is a tendency for a pupil's achievements 
in all subjects to run on a fairly even level, though the correlations 
are low enough for some pupils to exhibit considerable divergences— 
special strengths in some subjects, special weaknesses in others. 
During the secondary school stage, correlations between subjects 
decrease, partly perhaps because abilities tend to differentiate during 
adolescence and specialised talents develop, but mainly because the 
pupils form more homogeneous groups (cf. pp. 122-3). Nevertheless, 
the positive overlap of all intellectual abilities continues even beyond 
the university level. For example, the present writer inter-correlated 
the marks of 300 graduate students at a training college, and obtained 
small positive coeflicients between subjects as widely different as 
teaching skill, arithmetic, hygiene, education, psychology, speech 
training, singing, physical training, etc. (cf. Vernon, 1939). 


Composition . Я « . + 0:505 
History " 6 я . + 0:448 
Geography s; D 5 . + 0:456 
Nature Study . * А . + 0:458 
Mechanical Arithmetic j . + 0-283 
Arithmetic Problems . " . + 0:457 
Reading Fluency А 8 . + 0:323 
Reading Comprehension . . + 0-369 
Writing Speed  . à 5 . T0323 
Writing Quality А s . 43-0274 
Dictation . м Е . +0325 
Drawing + 0:275 
Handwork + 0-320 


These facts are of considerable importance to examiners. They 
indicate that a General Certificate in several subjects, achieved at the 
age of 15 to 18, is some criterion of aptitude for any intellectual 
iversity career, albeit not a very good one. 


occupation, or for a uni 
Similarly, the traditional method by which the Civil Service selected 


people for a wide variety of posts was a general examination based 
largely on subjects studied at school or university. This implied that 
the possession of ‘brains’ for one subject indicates aptitude for all 
other subjects. Though this method was a fairly sound one, the 
Civil Service has now come to realize that the efficiency of its 
selection can be improved by adding tests of the aptitudes needed 


for each type of post. 
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The General Factor Responsible for Positive Overlap. —Now when 
we find positive association, as in Burt's investigation, we are justified 
_in assuming the existence of a general factor—general educational 
ability—which runs through all the separate subjects. Some sub- 
jects may be said to be more ‘saturated’ with it than others, since they 
are more highly correlated with it. In the primary school we should 
get an excellent indication of this general factor from pupils’ marks 
in composition and in arithmetic problems, a much poorer indication 
of it from drawing or handwriting quality. Statistical methods are 
available for assessing a pupils standing on this general factor from 
his marks on all the component subjects (cf. p. 124). Further, it is 
quite easy to partial out the factor, and so to see whether or not it 
accounts for all the correlations between the separate subjects, 
whether, in other words, it is the only common element involved. 
When this is done, it is found that there are still a number of corre- 
lations over and above those due to the general factor. In Burt's 
study, the residual correlations indicated considerable overlapping 
within the following four groups of tests: (1) arithmetic problems 
and mechanical arithmetic; (2) handwork, drawing, writing speed 
and quality; (3) dictation, reading comprehension and speed; 
(4) composition, history, geography, and nature study. But there 
was little or no overlap, or even negative correlations, between the 
tests in one of these groups and those in another group. Such results 
show clearly the existence of group factors, i.e. of distinctive types of 
educational abilities, over and above general ability. In this instance 
there is an arithmetical factor, a manual one, a linguistic one, and 
one including all the higher, more integrative, school subjects. As 
Burt points out, we have here a basis for cross-classification of 
pupils. He suggests that they should be assigned to school classes in 
accordance with their achievements in the linguistic and ‘integrative’ 
subjects, and that they might advantageously be reclassified for 
arithmetic lessons, and again for manual subjects. 

Analogous findings were obtained in the writer’s study of training 
college marks. Three separate ‘families’ of subjects could be dis- 
tinguished, a practical one (teaching skill, speech training, and 
physical training), a scientific one (psychology, hygiene, arithmetic), 
and a literary one (English, speech training, education, history). Few 
researches along these lines have been carried out in the industrial 
sphere, owing to the difficulties of measuring and correlating 
different vocational abilities. But we might expect to get from them 


ANALYSIS OF ABILITIES 137 


similar results, for example, group factors for the engineering type 
of occupation, for the salesmanship type, the clerical type, and so on. 


FACTORIAL ANALYSIS 

The statistical investigations of Spearman (1927) and others have 
shown that it is possible to account for practically the whole of a 
set of test inter-correlations by postulating appropriate common 
factors.! It is legitimate, then, to regard each test as made up of two 
components, or as measuring two things. In so far as it correlates 
with other tests, it is measuring some factor or factors. This is often 
referred to as the test's communality. But in so far as it fails to 
correlate (and most test inter-correlations, be it remembered, are 
only moderately high), it is measuring a purely specific component, 
or something which is peculiar to that test alone and has no relation- 
ship to any other test. This is called the test’s specificity. It includes, 
of course, the error which is present in the test on account of im- 
perfect reliability? (cf. p. 129). Thus any test of an educational or 
vocational ability can be analysed into certain proportions of certain 
factors. For example, in Burt's research, the arithmetic problems 
test can be analysed into so much general educational ability, so 
much arithmetical group factor (these together make up its com- 
munality), and, thirdly, a component specific to that particular test. 
The mechanical arithmetic test can be analysed into a rather smaller 
proportion of general factor, a large proportion of the arithmetic 
group factor, and a separate specific component. The presence of 
these specific components reduces the correlation between the two 
tests, but it is still quite high, namely + 0-76, because they have both 
a general and a group factor in common. A test of another subject, 
such as handwork, still correlates positively with arithmetic prob- 
lems, because they are both partially saturated with the general 
factor, but the coefficient is a small one, namely + 0:38, since hand- 
work embodies a different (manual) group factor, as well as a 


different specific component. 

1 Note that we do not claim to be able to account for the whole of the correla- 
tions, for such correlations are, of course, always liable to errors of sampling. 
Often the residual coefficients, which are left after a general factor and some 
group factors have been extracted, are so small relative to their P.E.s, that it is 
not worth while analysing them further. Even with the most accurate techniques, 
such as Hotelling's and Burt's, the later factors are generally so small that little 
significance can be attached to them. . 

2 Thomson (1939) points out, however, that some of the influences producing 
unreliability, e.g. those of the function fluctuation type, may actually be com- 
mon to several tests, and so contribute to their communality. 
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By means of factor analysis, then, it is possible to reduce a large 
number of test results to a few underlying common factors. Any 
psychological test, or measure of a scholastic or occupational 
ability, can be regarded as made up of some of these factors, or as 
possessing a certain ‘factor pattern.’ Further, an individual person's 
standing on each factor can be derived from his test scores, and he, 
too, possesses a certain factor pattern. Thus, a pupil who is high in 
general educational factor will be above average in most of his school 
work, but his relative goodness at different subjects will depend on 
the strength or weakness of his group and specific factors. 

Dangers of Factor Analysis.—At this point a strong warning should 
be given against carrying the factorial conception of abilities too far. 
Factors are not entities in the mind whose nature or constitution, 
and whose strength or weakness, are immutably fixed. Some writers 
do seem to assume that factorial analysis is revealing the funda- 
mental elements of which human minds are compounded, much as 
chemical analysis reveals the elements from which chemical sub- 
stances are compounded. We should realize that factors consist 
primarily of categories for classifying mental tests and examinations. 
Their value lies in the fact that they tell us objectively what correlates 
or overlaps with what, and so they take the place of unverified 
faculties and other subjective conceptions of human traits and 
abilities. 

It is obviously illegitimate to compare factors in the educational 
field with chemical elements. For these factors, which are deduced 
from the correlations between various school subjects, must depend 
largely on how the subjects are taught and on the stage which the 
pupils have reached. Different factor patterns will, therefore, be 
found in different school classes, according as the teachers connect 
up, or bring about overlap in their pupils’ minds between, the sub- 
jects. For example, Oldham (1937-8) found very diverse relation- 
ships between arithmetic, algebra, and geometry among different 
12- to 13-year-old school classes, which could be ascribed to differ- 
ences in teaching methods and in the pupils’ attitudes towards the 
subjects. Others have proved that the factors extracted from a set of 
mental tests alter progressively with practice at the tests. 

Like all other statistical constructs, factors will be liable to a good 
deal of variation if they are derived from measures of small numbers 
of persons. But apart from such sampling errors, they must be 
regarded as depending to a considerable extent on the particular 
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group of persons measured. Thomson (1939) shows that they are 
markedly affected by the heterogeneity or homogeneity of this 
group of persons. In addition, they are dependent upon the particular 
set of tests used, since the factor pattern of a test is in essence an 
expression of the relations between this test and all the other tests in 
the set. Statistical analysis of a single test in isolation is incapable of 
telling us what factors it contains (though subjective analysis, or 
inspection of its content, may often give useful hints as to its prob- 
able factor pattern, which can later be verified by studying ob- 
jectively its correlations with other tests). However, this limitation 
may not be very serious, since, as we shall see below, at least some of 
the most important factors emerge in almost identical form from a 
wide variety of sets of tests. 

Then there is the more subtle difficulty that the same set of tests 
can always be factorized in a great many different ways. Most of 
these ways will lead to quite illogical results, so that the investigator 
has to choose from among the various possible factorizations the 
most meaningful and convenient pattern. Take, for example, the 
educational tests applied by Burt. We have so far regarded these as 
analysable into general educational ability, four group factors for 
related subjects, and separate specific factors for each test. Now Burt 
showed, incidentally, that the general factor was not identical with 
general intelligence as estimated by teachers, or as measured by 
intelligence tests, though the two overlapped closely. But it would 
have been quite legitimate to analyse measures of intelligence along 
with the educational measures, and to partial out intelligence first, 
so making it the principal factor underlying achievement. If this 
had been done, it is likely that a subsidiary general factor would 
have emerged representing the pupils’ application to, and interest 
in, their work in general. Intelligence plus this subsidiary factor 
would then make up what we have previously denoted as the 
general educational ability factor. The group factors would also need 
to be somewhat modified, since most of the fourth one (composition, 
history, etc.) might have been absorbed into, or accounted for by, 
the intelligence factor.* 

In investigating students' college marks, the writer obtained 
analogous results. The inter-correlations could be accounted for 
either by a very prominent general factor and group factors re- 

1In his later study (19392), Burt describes several methods of factorizing 
which do yield numerically different, though logically consistent, results, 
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stricted to a few subjects, or else by three factors, all of about equal 


prominence, each entering into several subjects. 


Certain other precautions which should be observed in inter- 
preting the results of factorial analyses will be mentioned later. 

Types of Factor Pattern.—Three main types of factor pattern 
recur very frequently in educational and psychological investigations. 
They may be represented by the following diagrams, each of which 


refers to twelve hypothetical tests, Nos. 1, 2,... 


I. BI-FACTOR PATTERN 


Test General Group‘Factors _ Specific 

Factor А в с! р Factors 
1 x x x 
2 x x x 
3 x x x 
4 x x x 
5 x x x 
6 x x х 
А х х х 
8 х х х 
9 х х х 
10 х x x 
11 x х х 
12 х X x 


П. MULTIPLE-FACTOR PATTERN 


Test . Common Factors | Specific 

|| B с p | Factors 
1 х х х 
2 х x 
3 x x x 
4 x x 
3 x x x х 
6 х х 
Ei x x x 
8 | x x x 
9 x x 
10 x x 
11 x х х 
12 х | х 


ИЕШЕ ЧН оү "РРА ЕН с = 


ANALYSIS OF ABILITIES 141 
III. UNI-FACTOR PATTERN 


| General i 
Test | = os 
1 x | ж 
2 x x 
К ж | X 
4 x | x 
5 X x 
6 | x x 
Т | X x 
8 x | X 
9 x x 
10 x x 
11 x | X 
12 | x | x 
| 


In the first, which Holzinger and Hartman (1941) call the bi- 
factor pattern, there is a general factor running through all the tests, 
four different group factors, and specifics. This pattern (or similar 
more complex ones) is usually adequate for representing the results 
of tests of abilities. For instance it applies excellently to Burt's 
investigation. 

The second, multiple-factor, pattern requires three, four, or more 
factors, none of them general, but all running through several tests, 
to account for the test inter-correlations. It will be seen that Test 1 
involves factors А and C and a specific. But as far as possible each 
test is covered by a single factor (with only negligible loadings on the. 
others). This type of pattern is referred to as Simple Structure (cf. 
Thomson, 1939; Thurstone, 1947). It is often used in analysing tests 
of abilities, but is still more appropriate where tests of personality, 
character, or temperament are concerned, since in these fields a 
general trait, running through all the tests, would seldom be appro- 
priate. For example, Thurstone (1931) obtained measures of the 
interests of a group of students in a large number of different occu- 
pations—advertising, art, chemistry, etc.—and found that the interest 
scores could be classified under four main factors. These he roughly 
identified as interest in language, in science, in business, and in 
people. Each separate interest could be made up of appropriate 
proportions of these four factors. Other instances have been sur- 


veyed by Eysenck (1953). 


Spearman's Two-factor Theory.—The third, unitary-factor, pattern 
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has aroused a great deal of controversy. Historically it was the first 
type to be discovered. Early in the present century, Spearman found 
that diverse tests of mental abilities usually gave inter-correlations 
which could be wholly accounted for (within the limits of their 
errors of sampling) by a single general factor plus specific factors. 
He enunciated certain criteria to which the correlations must con- 
form if this is to be possible, the most important of which is called 
the tetrad difference criterion.! The general factor obviously сог- 
responds to what we commonly mean by general intelligence, but 
Spearman preferred to symbolize it by the letter g, the specific factors 
by letter $,, $5, . . ., etc., so as to get over the difficulty, mentioned 
above, that the term ‘intelligence’ has been so diversely defined by 
different psychologists. His g factor, it was claimed, would be the 
same whatever the tests used. It could be measured, for example, by 
a battery of tests consisting of Analogies, Completion, Vocabulary, 
and Directions, or equally well by a battery consisting of practical per- 
formance tests (pp. 160-2). Each test would have its own independent 
5, but in so far as it overlapped with other tests, it would be measur- 
ing g. This view is widely known as the Two-factor Theory. 

For a while it was thought that the Two-factor Theory would 
apply to all tests of abilities, not only to intellectual ones, and that 
very few additional group factors would be needed. Only occasion- 
ally would a few tests which were very similar in content (e.g. tests 
of musical aptitude) be found to inter-correlate over and above their 
correlation due to g. 

Diflerent tests may differ considerably in their g-saturations. 
Those with the highest saturation seem to involve the capacity for 
seeing relationships, whilst tests which involve mainly mechanical or 
rote memory contain very little g. Among school subjects, classics is 
most dependent on g; French, English, history, and the like are also 
highly saturated. Manual subjects and music have much more 
prominent s's and relatively little g. Thus we find high correlations 
between classics and English, low correlations between handwork 
and music, and moderate correlations between classics and hand- 
work, or English and music. The Two-factor Theory not only 
explains these and similar facts as to the relations between abilities, 

1 For an explanation, see Knight (1933), or Spearman's own book (1927) 
The apparent discrepancy between the names—Two-factor Theory and Unitary- 
factor Pattern—is merely due to the fact that Spearman includes the specific 


components as factors, whereas the terms we have used refer only to the com- 
munality of the factorized tests. 
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but also provides the mental tester with a scientific basis for choosing 
tests which will give the best measures of g. Although every sub-test 
in a battery of intelligence tests will involve a certain s-factor, tests 
can be selected in which these s's are small, and when the tests are 
combined the s's will tend to cancel out, so leaving an almost pure 
measure of g. 

Criticisms of the Two-factor Theory.—The theory has been sub- 
jected to much criticism, both on psychological and on statistical 
grounds. Spearman suggested that g might correspond to the con- 
ception of ‘general mental energy.’ Whether or not this is a psycho- 
logically sound explanation need not concern us here. On the 
statistical side, Thomson (1939) and others pointed out that, al- 
though the analysis into general and specific factors is legitimate 
when the tetrad difference criterion is satisfied, yet this is not the 
only possible mode of analysis; that if mental abilities consisted of 
large numbers of overlapping group factors, the same criterion 
would still be satisfied. Thomson’s view would seem to the present 
writer to be preferable from the psychological standpoint, but 
Spearman’s view is perhaps more practically convenient, since we do 
already in everyday life assume the existence of something like g— 
whether we call it ‘brains,’ “brightness,” ‘intelligence,’ or what not— 
and the Two-factor Theory provides a scientific basis for this com- 


mon-sense assumption. 


The theory, and the experiments which it has stimulated, at one 


time seemed to discredit all the mental faculties which had so often 
been abused in traditional psychology and in common parlance. But 
more recent research with modern factorial methods indicates that 
the theory requires considerable modifications in the matter of 


group factors. A large variety of group, or multiple, factors has been 


described, particularly in American researches, and some of them do 


resemble the older faculties. Unfortunately, there are considerable 
disagreements between different factorists as to which are the most 
distinctive and fundamental, but the series that Thurstone established 
is very widely accepted. Thurstone (1938) applied an extensive bat- 
tery of psychological tests to university students and found that they 
resolved into factors which he identified as verbal ability (У), 
number ability (N), word fluency (ЈУ), spatial judgment (S), rote 
memory (М), perceptual speed (P), and deductive and inductive 
reasoning (D, Г). A fairly similar picture was obtained in studies of 
children, even down to the 5- to 6-year level, andTburstone has issued 
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sets of tests for measuring these factors at several age levels, namely 
the P.M.A. or Primary Mental Abilities tests. Note that he does not 
allow for a g running through all these abilities, though Spearman 
and his followers have shown that a pattern of general + group 
factors fits the same results as well as, or better than, a pattern of 
independent multiple factors.* 

It may safely be concluded that no test measures nothing but g and 
a specific factor, since the type of test material employed always 
introduces some additional common element. All verbal intelligence 
tests, together with other tests depending on manipulations of words, 
involve a verbal and educational factor, which has been named V or 
v : ed. Those based on multiple-choice items, to be done at speed, 
embody a further factor or factors, not present in unspeeded, 
creative-response tests such as Terman-Merrill. In order to escape 
the influence of v: ed, many mental tests have been constructed from 
abstract diagrams or pictures. Such tests, however, in so far as they 
involve spatial imagination or the mental manipulation of shapes, 
yielda spatial factor called k (El Koussy, 1935), or S (Thurstone, 1938). 
Certainly some, and possibly all, of the common non-verbal per- 
formance tests bring in this same К, and various minor dexterity 
factors in addition. Mechanical ability tests too measure k, together 
with a mechanical information or experience factor, m. School 
examinations and educational attainments tests depend (as Burt's 
research showed) not only on g and v : ed, but also on a factor which 
represents the pupils’ interest in and attitudes to work, and their 
character or temperament qualities. Following Alexander (1935), 
this is often referred to as the Y-factor. In addition, there are probably 
group factors for families of school subjects such as the linguistic, 
scientific and mathematical, practical and manual, and further sub- 
factors largely governed, as we have seen, by teaching methods. 

Educational and Vocational Applications.—The findings of fac- 
torial research have important educational and vocational bear- 
ings. If the Two-factor Theory was exclusively accepted, educa- 
tional and vocational guidance by means of tests would scarcely be 
possible. If we wished to advise a child as to the type of education 

1 A more detailed discussion of the pros and cons of the g + group-factor and 
the multiple-factor approaches is given elsewhere (Vernon, 1950), and it is shown 
that they can fairly readily be reconciled. 

? Alexander (1935) claimed the existence of a distinctive practical factor, F 


in certain performance tests ; but later work appears to indicate that it cannot be 
differentiated from К. 
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or the occupation for which he would be most suited, we could 


.apply tests of g and so deduce the general intellectual level of the 


work he should take up. But we could not decide whether he would 
be likely to do better at classics or science, at a clerical job or an 
engineering trade, and so on, except by giving a vast number of 
tests of different s factors, or by letting him try each type of work in 
turn so as to find by practical experience which specific one suited 
him. There could be no recommending of general lines of occupation. 
To a certain extent this does seem to be true. Educational and 
vocational guidance among adolescents are extremely difficult and 
uncertain. They have to be based more upon a subjective analysis of 
interests and character traits, background and relevant training, than 
upon objective predictions by tests. Yet common experience strongly 
suggests that there do exist general categories or types of work, and 
that therefore a number of group factors, additional to m, could be 
established by further research. If this were done, guidance would 
become much more scientific. Full consideration of a candidate's 
interests and character would, of course, still be required, but his 
scores on a series of tests for each of these factors would give 
objective indications of his probable ability along each of the main 
lines of work.! Much the same is true in clinical work with malad- 
justed children or neurotic or psychotic adults. Psychiatrists and 
clinical psychologists often claim, on the basis of poorly validated 
tests, that that their patients show deterioration or weakness in such 
faculties as memory, orientation, attention, visualization, etc. Factor 
analysis indicates that most of these tests are merely rather un- 
reliable measures of g, or g + Y: ed. But at the same time it can be 


1 Thomson (1939) has pointed out that when tests are given with the object 
of predicting educational or vocational achievement, the extraction of factors 
from the tests may constitute quite an unnecessary intermediate step. A much 
more efficient method is to correlate the tests directly with measures of achieve- 
ment and then to use multiple correlation for obtaining the most accurate pre- 
dictions. While this view is, of course, perfectly true from the statistical stand- 
point, it would seem to the present writer to be practically applicable only in 
educational and vocational selection, not in guidance. It should certainly be 
used by the tester who wishes to select the best candidates for some single and 
definite educational course or job. But the problems of the vocational adviser 
are much too complex to be tackled as yet by this exact method. He is forced to 
predict a candidate's probable achievement in generalized terms, or factors. At 
the present time he is doing so largely in terms of unverified factors, derived 
from subjective considerations of tests and lines of work. The view put forward 


above is that his procedure could be made more systematic and scientific if more 
factors were experimentally established. 
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used to isolate definite group factors such as mental speed, and 
flexibility in the formation of concepts, which are especially apt to 
be affected by abnormal mental states; and it can show which tests 
are the most promising for measuring such factors. 

While we have a fairly clear conception of the main factors under- 
lying the abilities of school-children and of unselected adults, the 
picture among older adolescents and adults of superior ability is 
much more complex and obscure. А g-factor can still be extracted 
from their mental test results, but it seems to play a decreasingly 
important part in educational or vocational achievement, and inter- 
ests, work attitudes, and temperament traits become more influential. 
Animportant series of researches by Guilford etal (1950-4) in America 
with high-grade groups such as air-force pilots and university 
students suggests that numerous intellectual factors can be dis- 
tinguished in the fields of reasoning, planning, judgment, originality, 
etc. ; and that it would be more profitable to develop tests for these 
than for a vaguely conceived general capacity of intelligence. Which 
of them will emerge as consistent and clear-cut, and as educationally 
or vocationally important, among other groups (e.g. British gram- 
mar school pupils and students) is still a matter of doubt. 

Conclusions about Factor Analysis.—A great deal of unavoidably 
complex and expensive research will be needed before we can hope 
to arrive at a reasonably complete map, as it were, of the whole 
realm of human abilities. The difficulties are enhanced because, as 
we have seen, the factor patterns which an investigator extracts are 
often dependent on the particular tests and particular testees he 
employs, and because there is much room for subjective judgment in 
the type of factor pattern he chooses. Research at present is rather 
badly co-ordinated, since each investigator sets out to explore fresh 
territory instead of trying to build on the factors already fairly well 
established by previous workers. Further, there is always the danger 
of mathematical rather than psychological considerations getting 
the upper hand. As already mentioned, factorial investigations are 
not revealing the elements out of which the mind is compounded, 

but are classifying test performances with a view to predicting more 
effectively other performances, e.g. in the educational or vocational 
sphere. But by no means all of the important facets of human 
nature are susceptible yet of accurate measurement. There may be 
abilities, such as executive capacity, intuitive power, goodness of 
teaching, and social intelligence, which are generally recognized in 
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daily life, but which are not yet properly substantiated or analysed 
owing to the inadequacies of our tests. Still more is this true in the 
realm of emotional traits and interests, for reasons which the writer 
has described elsewhere (Vernon, 1953). Yet we cannot hope to 
make successful predictions about an individual child or adult 
unless we know every side of his make-up. It may well be that all 
these attempts at analysing the measurable abilities and traits of 
human beings will never give us a picture of an individual personality 
considered as a whole, that we shall always need to supplement test 
results with subjective judgment and intuition in order to resynthesize 
them into a living entity. But this, 100, is a matter of psychological 
controversy which the reader may follow up elsewhere. 

We have no space to discuss the actual statistical techniques of 
factor analysis. Spearman has fully described his method in The 
Abilities of Man (1927). It is not difficult to apply, though rather 
laborious. It is the most accurate method of extracting unitary- 
factor patterns, but it is hardly suitable for bi- and multiple-factor 
patterns. Holzinger’s (1941) method is an extension which deals 
readily with bi-factor patterns, though it does not seem to possess 
much advantage over the flexible group-factor techniques outlined 
by Burt (1950). The most widely used technique is Thurstone’s 
centroid method, which is well-described by Guilford (1936), and 
is not difficult to acquire. But it has many more elaborate refine- 
ments, for which Thurstone’s book (1947), or Cattell (1952), should 
be consulted. Other more accurate techniques, whose aims and 
guiding principles are somewhat different from those described in 
the present chapter, have been developed by Hotelling, Kelley, and 
Burt. The clearest account of the whole subject is that of Thomson 


(1939). 


CHAPTER IX 


MENTAL TESTS 


MENTAL testing is simply a more refined and scientific method of 
doing something which we all of us do every day of our lives, that 
is assessing one another's abilities and character traits. Our ordinary 
method is to observe a person's behaviour, including his facial 
expressions and gestures, in various situations and circumstances, 
and his speech or written productions. For example, we might watch 
a small boy playing with bricks, and from his skilfulness, or the 
elaborateness of the resulting construction, jump to the conclusion 
that he is quite a bright child. Obviously the basis for this and other 
such judgments is very haphazard and unreliable. There is no sure 
evidence that the boy is acting more, or less, intelligently than the 
majority of children. Nevertheless this performance may quite well 
be refined and turned into a scientific mental test. So-called per- 
formance tests of intelligence are based on just such activities with 
bricks, blocks, and other concrete material. 

Let us then first consider the essential features of a mental test 
which distinguish it from such everyday judgments and from the 
slightly more scientific school examination. 

1. Standardization of Test Situation.—The first step which the 
tester takes is to standardize the test situation. He will, for example, 
adopt some standard set of bricks and give definite instructions, 50 
that all children who are subjected to the test shall do it under as 
nearly as possible identical circumstances. Until this is done it is not 
legitimate to compare the performances of different children. With 
every good test there is published a detailed set of instructions, to 
which the tester must closely adhere. School and university ex- 
aminers, on the other hand, usually give no instructions beyond 
stating the number of questions to be answered, and they neglect to 
inform the examinees what type of answers they require. Because of 
this, and because of frequent ambiguities in the wording of questions, 
different examinees may conceive the tasks that are set them in a 
variety of different ways, and it becomes exceedingly difficult to 
compare and mark their answers. 
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The material of a test, its conditions of application, and its in- 
structions are further planned to be readily comprehensible to, and 
suitable to the mental level of, the children or adults for whom it is 
intended. This is another feature which some school examiners 
might be advised to copy instead of setting subjects for composi- 
tions, or other questions, which are far above the heads of the 
pupils. 

We must, however, admit that several widely used tests do not 
entirely live up to these criteria. Some of them were originally con- 
structed and standardized in America; hence they require consider- 
able modification or translation before they are suitable for British 
children. We are forced to employ these ‘second-hand’ tests because 
the production of fresh tests is an extremely lengthy and expensive 
business. Unfortunately different British testers do the necessary 
‘re-modelling’ in different ways, so that the test situation is no longer 
identical for all. 

2. Standardization of Scoring.—TIhe mental tester does not ob- 
serve the child's behaviour, or his finished product, in the casual 
manner implied by our illustration, but obtains an accurate record 
of them. He insists on knowing definitely what was the fest response, 
i.e. the child's reaction to the previously laid down test situation. 
Either the response is something which can be delimited in quantita- 
tive terms (e.g. so many bricks put into their correct positions in so 
many seconds), or else it is something which can be designated un- 
equivocally as right or wrong.* It is most important that the personal 
opinion of the tester should play no part in this record. Any two 
testers observing and scoring the response should arrive at identical 
records. If this condition holds the test is called an objective one. AII 
good mental tests, then, are provided with scoring keys or manuals 
of instructions which enable the recording and scoring to be done 
objectively. 

Our everyday judgments of traits or abilities are, 
decidedly subjective, for numerous investigations have demon- 
strated their liability to vary with the person who makes them. Some 
teachers are still convinced that they can assess the intelligence of 
their pupils better than can any test. But the fact is that different 


teachers, assessing the same pupils, give such different opinions that 
little trust in any of them. The 


the scientific psychologist can put 
1 So-called Quality Scales are tests whose scoring constitutes an exception to 


this rule, cf. p. 154. 
M.A.—11 


by contrast, 


150 THE MEASUREMENT OF ABILITIES 


same is true of many school and other examination marks, as will 
appear in Chap. XI. 

3. Norms.—The layman commonly seems to regard the ob- 
served behaviour as having some absolute value or significance. 
There is no doubt in his mind that it indicates high intelligence, or 
poor sociability, or whatever the trait may be. The more cautious 
psychologist realizes that such judgments must be based largely on 
more or less vague recollections of the behaviour of other similar 
children under similar circumstances, and so insists that all grading 
of abilities is essentially a comparative or relative matter (cf. Chaps. 
II and IV). Thus one of the most vital accompaniments of a mental 
test is the norms, by reference to which any child’s goodness Ог 
poorness of performance may be assessed. Unfortunately it is also 
the feature in which many published tests are defective, on account 
of the difficulties already described (pp. 66-73). 

In the nature of examining it is hardly possible to establish norms 
for examinations, since they cannot be applied to large numbers of 
pupils or students beforehand for this purpose, as is the standardized 
test. We have seen, however, that the process of rationalizing marks, 
which is essentially equivalent to the setting up of provisional 
norms, is necessary if the present ambiguities in the significance of 
school or examinations marks are to be avoided. 

4. Reliability and Validity—A single piece of behaviour, 
casually observed, as in our illustration, would probably be very low 
in reliability. A discussion of the statistical methods of determining 
reliability was given above (p. 126), and the reliability of several of 
the more important tests is mentioned later in this chapter. 

The most serious defect in the layman’s everyday judgments is that 
he seldom has any good proof that the observed behaviour validly 
indicates the ability which he deduces from it. It may have arisen 
from some quite different trait or ability, or else from an ability 
which is almost wholly specific (in Spearman's sense), i.e. not corre- 
lated with anything else. To be reliable a test need only measure 
accurately the ability to perform that test, but to be valid it must also 
correlate effectively with other measures of whatever it is supposed 
to test. The mere inspection of the content of a test is seldom sufficient 
for the determination of its validity, although it is a necessary 
preliminary. 

Such inspection of educational tests and examinations may reveal 
their failure to cover the ground which the pupils should have 
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accomplished. Thus school and university examinations are often de- 
fective because their questions are too few in number, or are badly set. 
It will be obvious also to a tester that the content of some of the pub- 
lished educational tests is unsatisfactory, either because they are 
designed for pupils whose tuition differs widely from that current 
among his testees, or because they are out-of-date. But apart from 
such flaws, many tests involve group or specific factors which detract 
from their effectiveness as measures of some general ability. For 
example, a test of reading may be based on the capacity for pro- 
nouncing difficult words. This may or may not indicate children's 
capacities in other aspects of reading, such as their fluency, or their 
comprehension of what they have read. Investigation by means of 
correlations is essential for establishing these points. Again, the 
technique of the test, i.e. the way in which the testee is required to 
express his ability, always seems to introduce an irrelevant factor. 
In Chap. XI we shall show that in an ordinary essay-type of exam- 
ination, the translation of the examinees’ knowledge into essay-form 
is one such factor, and in Chap. XII the examiner's translation of his 
questions into new-type or objective form will be recognized as an- 
other. Temperamental factors, health, fatigue, and the like also 
influence a testee's performances at many tests, and so reduce their 
validity as measures of ability. 

Most educational and vocational tests are assumed not only to 
be valid indicators of some present general ability, but also to be 
prognostic of future achievement. Their predictions should then, of 
course, be followed up and compared with subsequently obtained 
measures of achievement. In the educational field there have been 
many such investigations whose results usually'show our tests and 
examinations to possess disappointingly poor validity (cf. Chap. XI). 
Vocational psychologists have also tried to correlate their predictive 
tests with some measure of output, or with estimates of efficiency 
obtained from supervisors or foremen. As shown elsewhere (Vernon 
and Parry, 1949), however, it is not always easy to secure a satisfac- 
tory criterion of success or failure in an occupation against which 
the test results can be checked. 

The validation of intelligence tests, ог of tests of capacities such as 
practical ability, verbal ability, and the like, is particularly difficult, 
since we possess no objective criterion of intelligence, etc., with 
ings have sometimes been used, that is, 


which to compare them. Rati 
teachers’ estimates of the intelligence of children in their charge, but 
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these are highly fallible. If psychologists are unable to agree as to 
what they mean by intelligence, it is unlikely that laymen's views will 
beany more concordant. Indeed, if teachers could judge it accurately, 
there would be no need for tests. The main defect of such ratings is 
that they are strongly biased (often quite unwittingly) by considera- 
tions such as the children's industriousness and good school be- 
haviour. Parents’ judgments are still less trustworthy. Several group 
intelligence tests have been validated by comparing their results with 
the results of one of the Binet scales. But the validity of these scales 
is open to doubt. The items which they contain have always been 
chosen so as to differentiate older (hence presumably more intelligent) 
children from younger (less intelligent) ones. This criterion is, how- 
ever, dubious in value, since it would admit tests of, say, height and 
weight which have very little validity as indicators of intelligence. Yet 
we do not rely merely on subjective judgment when we state that these 
must be excluded from a test of intelligence, for in the absence of 
any external criterion of validity, we make use of what has been called 
the method of internal consistency. 

This method consists either of factorial analysis, or of a modifica- 
tion thereof. All the test items or sub-tests are selected in the first 
place on the basis of subjective opinion; they must appear to the 
tester to involve the exercise of intellect. Thereafter they are studied 
empirically. Either each item is compared with every other, or with 
the sum of all the rest of the items, or sets of items such as the sub- 
tests in a group intelligence test are compared with other sets. The 
inter-correlations will serve to show whether or not all the items are 
consistent with one another—whether they are measuring one and 
the same thing. And items that fail to conform to this criterion are 
excluded from the final, published, version of the test. The statistical 
conditions imposed by Spearman in his factorial analysis of group 
intelligence tests are much more rigorous than those used by Terman 
in revising the Binet scale. Both, however, prove that there is a gen- 
eral factor running through all their test items or sub-tests. Spearman 
and his followers wish to measure nothing but g, that is, they aim at 
a uni-factor pattern, and so exclude a number of sub-tests which fail 
to conform to such a pattern. The Binet test, on the other hand, al- 
though it certainly embodies a g-factor, contains a hotch-potch of 
heterogeneous tasks which would, if fully analysed, yield several 
group factors (a bi-factor pattern), or even a set of multiple factors 
(cf. Burt, 19395). The effects of these different techniques of test 
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construction will be discussed more fully below. What matters for 
our present purpose is that when a test, or a set of tests, is internally 
consistent, this signifies that it will validly predict any other be- 
haviour which involves the same general factor. Intelligence tests 
work fairly effectively because scholastic success, and many of the 
other intellectual activities of everyday life, are known to be depen- 
dent largely on the g which they measure. 

Test Construction.—Before proceeding, we would strongly urge 
amateur testers and teachers not to attempt to make up their own 
tests, apart from new-type ones for school examination purposes. To 
devise test items similar to those which appear in published tests is 
not difficult, but a great many technical features enter into the selec- 
tion of suitable items and into their standardization, in addition to 
those described here. The points which we have outlined are meant, 
not to show how to construct a test, but to enable amateurs to choose 
more wisely the tests which will be most suitable for their purposes, 
and to give them some assistance in understanding what they are 


doing when they apply published tests. 


TYPES OF TESTS 


At the end of this chapter is given a classified list of most of the 
mental tests available in this country. The main headings are: 


I. Attainment, Achievement, ог Educational Tests. 
A. Examinations. 
B. Standardized Educational Tests. 
1. Individual reading tests. 
2. Language development and Vocabulary. 
3. Individual oral arithmetic. 
4. Group tests. 
II. Individual Intelligence Tests. 
A. Versions of the Binet-Simon scale. 
B. Wechsler scales. 
C. Miscellaneous scales. 
D-E. Performance Tests. 


Ш. Group Intelligence Tests. 
A. Oral. 
B. Non-verbal. 
C. Verbal. 
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IV. Aptitude and Special Ability Tests. 
V. Personality Tests and Assessments. 


The last two sections merely mention some of the tests commonly 
applied in this country, but the first three aim to be comprehensive. 
Although we shall not attempt to describe particular tests, the follow- 
ing general explanations of, and comments on, the main types may be 
useful to those who are not already thoroughly familiar with the field. 

Examinations and Educational Tests.—Examinations are still the 
most important method of educational measurement, and so come 
at the head of our list. Their defects, and the possibilities of improv- 
ing them, or of employing substitutes, are discussed in Chaps. XI- 
XIV. Standardized educational tests must be contrasted with them, 
for although they are made up of the same kinds of items or questions 
as occur in objective or new-type examinations, their functions are 
quite different. They are not designed to cover the work done by 
particular classes of pupils or students. Nor are they meant, pri- 
marily, for selecting pupils who are most likely to do well in some 
future work, though they are nowadays quite commonly used for 
assessing attainments at the 11+ selection stage. Rather they serve 
to show what is the standing of a class, or of an individual pupil, 
relative to other similar pupils, either on some general subject such 
as English or arithmetic, or on special branches of, or stages within, 
a subject (these latter being known as diagnostic,’ or analytic, tests). 
The fact that they are published, and their expense, obviously make 
them unsuitable for use as school examinations. By means of them, 
however, a teacher can usefully grade his pupils at the beginning of 
the school year, so as to find which ones will need most coaching, or 
in which parts of a subject they are most advanced or backward. If 
the norms are adequate (cf. Chap. IV), they will tell him how his 
pupils compare with pupils in general of the same age. The psycholo- 
gist at a Child Guidance Clinic, or the vocational adviser, finds them 
especially valuable for assessing a child's strong and weak points. 

Quality Scales.—Although most educational tests consist of large 
numbers of short items, both so as to cover the subject as thoroughly 
as possible, and so as to admit of objective scoring (cf. Chap. XII), 
some educational products which are too complex to be assessed in 
this piecemeal fashion can nevertheless be measured by ‘quality’ or 


1Cf. the excellent discussion of diagnostic tests by Schonell (1950); also 
Whilde (1955). 


MENTAL TESTS 155 


product scales.' Such standardized scales are available for grading, 
handwriting, composition, drawing, etc. Each of these consists of a 
series of specimen products which have been carefully chosen as 
representative of certain levels of achievement. The testee is told to 
write or draw the same matter in his normal manner, and his product 
is, as it were, slid along the scale, or compared with the specimens, 
until one is found to which it corresponds in achievement. It is then 
given the score of this specimen. Such grading cannot be made 
wholly objective, but there is evidence that it improves with practice. 
Different testers will, when experienced, award the same or nearly 
the same score to a child's product. Certainly the process is fairer 
than grading without any external standards. Hence the adoption of a 
similar plan in the marking of school work was advocated in Chap. П. 

Burt's scales all consist of specimens selected from a large number 
as being typical of the work of average 5 +,6+,--- 14 + child- 
ren. Schonell's composition scales, covering the 7- to 14-year range, 
are similar. Cattell's handwriting scale for school-leavers and his 
drawing scale for adults are merely graded I to V, representing 
different degrees of merit at these age levels. 

Quantitative Tests and Scales.— The scoring of quantitative tests 
either by the ‘rate’ or ‘power’ systems, or by a mixture of these, has 


already been described in Chap. II. It will be noted that scarcely any 
e the 14-year level, and that very few 


tests are available much abov 

deal with subjects outside the 3 R’s. One obvious reason for this is 

that secondary school curricula vary so widely, and that only rela- 

tively small numbers study а foreign language or a science, etc., by 
arable levels. Nevertheless it would not 


comparable methods at comp 

be difficult to devise and standardize tests in a number of subjects 
suitable for some fairly homogeneous groups of pupils or students, 
such as those in modern or in grammar schools, in teacher training 


colleges, or provincial universities. Certainly our tests lag far behind 
those available for similar purposes in America. 


INTELLIGENCE TESTS 
Intelligence tests closely resemble quantitative educational tests 
(to which they are, of course, historically prior) in their construction 
and scoring. Thus they may consist of: 
(a) A series of tasks graded in difficulty (power plan). 
(b) Tasks such as performance tests where the speed or number 


of moves are recorded (rate plan). 
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(c) Sets of questions which are roughly graded, but where the 
Score depends on the number answered within a certain time limit. 

Group tests generally include a ‘battery’ of some half-dozen (be- 
tween about four and twelve) ‘sub-tests,’ each containing some twenty 
(between ten and fifty) questions or items. Alternatively they are 
arranged in *omnibus' form, where items of all kinds are mixed up. 
In the battery type, each sub-test has its own instructions and sample 
items, and is timed separately. In the omnibus form, the instructions 
and samples may be printed along with the questions, or included 
in a preliminary practice sheet, and one time limit suffices for the 
whole test. An omnibus test is therefore somewhat easier to give, but 
a test battery has the advantage of providing rest pauses every few 
minutes. 

Almost all intelligence tests thus consist of rather large numbers 
of items. Performance tests are exceptional, but it is desirable to 
apply several, say six or more, of these. The tests are always applied 
and scored in a standard manner, and the scores either take the 
form of Mental Age years (as in the Binet and some other tests), or 
percentiles or standard score I.Q.s (cf. p. 67). 

The Innateness of Intelligence.—Many of the standard textbooks 
describe intelligence as a general innate capacity underlying all our 
abilities, dependent on the genes that we inherit, and therefore fairly 
constant throughout life. Thus intelligence tests differ, it is said, 
from attainments tests in that they consist of tasks whose solution 
depends upon native ability rather than upon acquired experience 
or upbringing. This view, however, seems rather unrealistic. For 
although there is ample evidence of the existence of inherited 
differences in intellectual potentiality, we are quite unable to observe 
or measure such a capacity. The good or poor intelligence which we 
observe in everyday life, at school or at work, and which we try to 
measure by our tests, is the product of innate factors and environ- 
ment. As Hebb (1948) and Piaget (1950) have shown, the young 
child gradually builds up or acquires his perceptions and habits and 
his thinking through contacts with his physical and social environ- 
ment; and the level that he reaches at any age is affected considerably 
by the kind and amount of intellectual stimulation that the environ- 
ment has provided up to that age. This level is fairly stable under 
normal conditions of upbringing—more so than would be gathered 
from the publications of some left-wing theorists and of some 
extremely environmentalistic American investigators. But it does 
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fluctuate more than would be expected from the statements of some 
of the early pioneers of mental testing. For example, over the prim- 
ary school age range from 6 to 11, or over the range from 11 to 18 [ 
ог so, the correlation between two similar tests is very far from per- 
fect, but it does not sink below + 0-70. According to the Probable 
Error of estimate technique (p. 127), this means that the average or 
median child alters by about 7 I.Q. points up or down on retesting. 
Many are more stable than this, but a very few show much larger 
gains or losses of 30 or more points (4 x P.E.); and 17% are liable 
to alter 15 or more points either way.! At the same time the stability 
is great enough to allow us to make useful educational or vocational 
predictions, for the majority of children, over several years. Par- 
ticularly striking evidence of this fact is provided by Terman and 
Oden's (1947) follow-up of children with very high I.Q.s over 25 
years. A fuller discussion of these controversial questions, and of 
the psychological nature of intelligence, may be found elsewhere 
(Vernon, 1960). 

Thus a better way of looking at intelligence and attainments is to 
say that the former refers to the more general qualities of thinking— 
comprehension, level of concept development, reasoning and grasp- 
ing relations—qualities which seem to be acquired largely in the 
course of normal development without specific tuition; whereas the 
latter refer more to knowledge and skills which are directly trained, 
and whose acquisition and retention depend more on the person's 
interests and on his personality traits such as industriousness. There 
is no sharp dividing line between them. Indeed some items such as 
vocabulary are to be found either in intelligence or in English attain- 
ment tests. Again some of the items in the Binet and Wechsler 
scales, and in such well-known group tests as Army Alpha, are 
based on general knowledge or information. Although such know- 
ledge is obviously acquired, these work very well as measures of 
intelligence, since the more intelligent are more likely to pick it up, 


in or out of school, and to retain it.? І.О. апа Е.О. always show very 
ariations are greater, and they may be increased 
by numerous factors such as above average ability, standard deviations greater 
than 15, inaccurate norms, ог practice and coaching effects. They are much 
larger, also, if the first test is given before the age of 6. Indeed so-called develop- 
mental quotients obtained from tests of 0 to 2}-year-olds have virtually no 
predictive value for later IQ. 

з А more serious criticism is that girls and women tend to be considerably 
poorer in general knowledge than boys and men; and it is preferable to use items 
for intelligence tests which show no large sex differences. 


1 With repeated retestings v: 
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close overlapping, though some children do come out markedly 
better on one than the other; and investigations such as Burt's (1937) 
and Schonell's (1942) into educational backwardness reveal that 
those with relatively low attainment are usually handicapped by 
bad home conditions, poor health, ineffective teaching, or emotional 
maladjustment, or a combination of these. Thus even though we must 
admit that intelligence and the I.Q. are only partly innately deter- 
mined, it is still true that they give better indications of capacity for 
acquiring further education than do present attainments alone. 
Intelligence tests are more fair to children from remote rural 
schools, and others whose schooling has been affected, c.g. by 
absences, and to children of inferior background ; although all these 
factors themselves do have adverse effects on the І.О. 

The majority of tests employ the verbal medium since man's 
higher thinking and reasoning capacities generally involve the use 
of words, numbers, or other symbols. Morcover опе of the most 
important, if not the most important, use of intelligence tests is that 
of predicting scholastic aptitude, and as school work is itself pre- 
dominantly verbal, verbal tests are certainly the most useful. But 
this means that people's performance at such tests is considerably 
affected if they have had inadequate opportunities for acquiring 
language, for example the foreign-born or bilingual, the deaf, or 
children of gipsies or canal bargees who receive little or no schooling. 
Many testers therefore prefer non-verbal tests, consisting of prob- 
lems based on pictures, abstract diagrams or concrete (performance) 
material. While these may be more fair to the verbally handicapped, 
they cannot be considered as giving a truer measure of innate 
ability. A British or American child, for example, has much more 
opportunity than an African child for acquiring skills with pictures 
and blocks, and a middle-class British child probably has advantages 
over a lower working-class child. Nadel (1939) and Biesheuvel ( 1949) 
show convincingly the fallacies of assuming that racial differences in 
intelligence can be measured by such tests, or indeed that any type 
of test is ‘culture free.’ Tests of the concrete or pictorial type are 
valuable not only for the deaf and the illiterate, but also for younger 
children who are less interested in verbal tests and less accustomed 
to verbal tasks than older persons. But these, and other non-verbal, 
tests give decidedly poorer predictions of educational capacity, 
except possibly in the technical, scientific, and mathematical fields. 

Individual Tests.—Up to 1937, the majority of testers of individual 
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children used the Stanford Revision of the Binet-Simon scale, pub- 
lished by Terman in 1916, or Burt's unpublished restandardization 
of this Revision. The New Stanford Revision, or Terman-Merrill 
scale then became the most popular version (Terman and Merrill, 
1937). This is superior both because it contains very full instructions 
for application and scoring, which help to obviate the confusions 
and errors previously so common among inexperienced testers, and 
because of its greater extensiveness. It is issued in two forms, L and 
M, both of which cover almost the whole range of intelligence from 
that of 14-year children to that of adults. At the same time it has its 
difficulties and weaknesses: 

(a) The original American versions cannot be applied as they 
stand in this country, because they are replete with words and phrases 
that require ‘translation’ into English. Some of these have been 
altered in the edition published here, but many other desirable 
modifications have been suggested. Of the latter, the simplest and 
the most accessible are those prepared by the Scottish Council for 
Research in Education (Kennedy-Fraser, 1945). Again the typical 
responses quoted in the scoring instructions are of little use since, 
even after translation, they represent the thoughts and expressions 
of American children. The tester must try to decide whether his 
testees’ responses correspond in intellectual level to these accept- 
able and unacceptable responses. 

(b) None of the British versions has been restandardized. Thus 
the tester must expect to find that many of the tests are misplaced in 
order of difficulty. Children may fail all the tests in one year, and 
yet pass one or two in a higher year. The same applies to items within 
a test, Thus the Vocabulary words and some of the Absurdities, 
etc., require rearrangement. 

(c) It is difficult to obtain toys and other practical material, 
needed for tests of younger children, which are precisely comparable 
to the originals, and this material is very expensive. Fortunately itis 
used only at Mental Ages of 4 and less. For almost all children of 
primary school age, the printed cards and beads and blocks are 
sufficient. 

(d) Most of the tests near the top end of the scale, labelled Aver- 
age or Superior Adult (M.A.s 15-22), are too childish in content to 
be suitable for adults, although they are useful for discriminating 
among children of above average and very superior intelligence. 

(е) Despite the care taken over the American standardization, the 
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tests from about 11 years onwards appear to be considerably too 
easy, and testers complain that far too many adolescents are getting 
high I.Q.s in the 130-160 range. It may be that these queer results 
arise merely from variations in the Standard Deviation of I.Q.s at 
different age levels (cf. p. 71); and Fraser Roberts and Mellone 
(1952) supply a useful correction table which allows for such varia- 
tions. But, as pointed out earlier (p. 73), no test based on M.A. 
scoring can be expected to yield really satisfactory results. 

The Wechsler intelligence scales surmount several of these 
difficulties. The original Wechsler-Bellevue test (Wechsler, 1939) is 
now replaced by W.A.LS. (Adult Intelligence Scale), which has 
norms from 17 to 75 years. The W.I.S.C. (Intelligence Scale for 
Children) is applicable from 5 to 154- years, but is not—like 
Terman-Merrill—suitable for pre-school children. Each scale in- 
cludes 5 or 6 verbal sub-tests and 5 performance tests, and thus yields 
a verbal and a performance, as well as a combined, I.Q. Here too 
some adaptations are needed for British testees, but standard modi- 
fications have been prepared by the British Psychological Society. 
The pictures, blocks, etc., needed for the performance tests are un- 
avoidably expensive, but the standardization and scoring are so much 
superior to those of any other scale that most psychologists con- 
cerned with individual testing of children and, especially, of adults 
prefer the Wechsler. 

Valentine's Intelligence Tests for Young Children (1948) are 
particularly suitable for 2- to 5-year-olds, though also providing à 
short battery of individual tests for children up to about 11 years. 
This scale has the advantage that all the necessary materials are 
included in the handbook, or can easily be constructed by the tester. 
For 0-2-year-olds, Griffith’s (1954) Mental Development Scale is the 
most thorough, and the best standardized in Britain. It yields 
separate quotients in each of five fields of development—Locomotor, 
Personal-social, Hearing-speech, Eye and Hand, Performance, and 
а combined or general quotient. In testing at this level it is especially 
important that the tester be thoroughly trained and experienced in 
managing babies and using the correct materials and instructions. 

Performance Tests.—As we have seen, these practical tests may be 


1 The original version may, however, continue to be used for some time in this 
country, partly because more thorough English adaptations have been worked 
out, and partly because it normally takes only 45 minutes to apply as compared 
with over 60 for W.A.LS. 
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used when verbal ones are inapplicable, or they may usefully supple- 
ment the predominantly verbal Binet-type test. Their chief merits are 
their greater attractiveness to children, and their capacity for bring- 
ing out a variety of temperamental reactions. A testee's manner of 
approach to such tests, and his behaviour when confronted with 
difficulties, throws much light on his impulsiveness, persistence, 
complacency, and other qualities, which cannot as yet readily be 
tested or measured objectively. 

The majority of performance tests suffer from the practical draw- 
backs that they are very unwieldy and difficult to transport, and 
extremely costly. Some testers are suspicious of them because 
different tests often give very inconsistent results. A child with a 
Binet M.A. of 9 years may, for instance, pass performance tests at 
anywhere from the 7- to the 12-year levels. This is partly due to poor 
standardization, and to variations in the actual materials and 
methods of application. As they have to be applied individually, the 
task of providing accurate British age norms is a difficult one (cf. 
Vernon, 1937). Even if they were perfectly standardized, we should 
still expect considerable inconsistencies, since their inter-correlations 
show that each test involves a large specific factor, or unknown 
group factors. Many of them are also poor in reliability, and the 
evaluation of this feature is difficult because of the large practice 
effects if they are applied twice to the same testees. Thus the mental 
tester should never place much reliance on a single performance test, 
nor on two or three. A small number may be sufficient to indicate 
whether a Binet 1.0. is likely to have been distorted by some verbal 
defect, but the median or mean score from at least half a dozen, or 
the total result from a battery such as Drever-Collins, W.A.I.S. or 
W.L.S.C., should be taken if a reliable performance test M.A. or I.Q. 
is desired. 

Undoubtedly then their validity, either as measures of g or of 
*practical ability," is poor. True, we have seen that they may involve 
the spatial ability factor, k, to some extent, and so be useful in 
selection for mechanical occupations and technical courses. But 
there is no real justification for the view that they give a better 
indication of intelligence for daily life purposes than the more 
reliable verbal type of test. It is interesting to note that several of 
them are markedly affected by emotional maladjustment and neuro- 
ticism. Children tested at a Psychological Clinic tend to make poorer 
scores than on Binet. Delinquents, on the other hand, who are often 


162 THE MEASUREMENT OF ABILITIES 


very backward educationally, do tend to do comparatively well on 
performance tests (cf. Earl, 1939). 


GROUP TESTS AND INDIVIDUAL TESTS 

The merits and defects both of individual tests such as Terman- 
Merrill, Wechsler and performance tests, and of group verbal or 
non-verbal tests, may best be brought out by contrasting them under 
the following series of headings: 

1. Age Range.—Individual tests must be used with very young 
children because it is practically impossible to hold the attention of 
a group, or to get them all to act alike at the same moment. Further, 
it is not natural for children to apply their intelligence to symbols on 
paper—either words, pictures or diagrams—until they have settled 
down to formal schooling. Manipulation of concrete objects or 
individual conversation with the tester constitute much more appro- 
priate testing media. Group tests are available which claim to 
measure Mental Ages down to 6 or even 5 years. The instructions for 
these must, of course, be given orally. But their value is very doubtful, 
for the above reasons. They may sometimes be useful for classifying 
pupils aged 7 +, i.e. children whose M.A.s are likely to run from 
5 to 9. 

By 9 or, better, 10 years, children can read instructions for them- 
selves, are thoroughly accustomed to class work, and can readily 
deal with a variety of verbal problems. This is therefore a very suit- 
able age at which to begin group testing. Above 14, group tests have 
definite advantages over individual, both because they can readily 
be made difficult enough even for the highest intelligence levels, and 
because their items are much less apt to appear silly and childish to 
suspicious adolescents or adults. Nevertheless, the W.A.LS. or 
W.LS.C. are preferable for abnormal cases who would not settle 
down readily to the group test situation, e.g. the feeble-minded or 
patients in a mental hospital or clinic. 

2. Ease of Application.—Group tests are easier to procure, to 
apply and to score, and they demand less training and experience on 
the part of the tester. Possibly, however, these are disadvantages. 
We assume too readily that group test performance is unaffected by 
the tester and the atmosphere he creates. As shown in the next 
chapter, there are many errors of application and interpretation that 
inexperienced testers often commit. 

3. Time.—Needless to say, the time saved by applying group 
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instead of individual tests is enormous. For instance, a class of forty 
can usually be tested in less than an hour, and their answers scored 
in four to eight hours, whereas an experienced Binet tester needs 
about half an hour for each young child, three-quarters for an older 
one, and an inexperienced tester much longer still. The point, how- 
ever, is whether this period per child is worthwhile. The majority of 
psychologists would certainly say that it is, and that an individual 
test can reveal much better than a group test things which the 
child's teachers and parents may have failed to recognize even after 
years of acquaintanceship. Probably the best compromise would be 
for a school psychologist or infants’ mistress to test each child 
individually at 6, for a group non-verbal or oral test to be given in 
class at 8, and verbal group tests at 10 and 14 (assuming that the 
results of selection tests are available at 11 +). Such results would be 
entered on a cumulative record card, and would provide a very 
reliable picture of each child's intellectual development. Individual 
tests would need to be given from 7 onwards only to exceptional 
pupils whose group test results were highly irregular, or about whom 
some important decision had to be made for purposes of educational 
or vocational guidance. 

4. Time Limits.—Most of the items in the Binet scale and many 
individual performance or educational tests are untimed, whereas 
the great majority of group tests have to be done under speeded 
conditions. Are not, the critics ask, some children slower but surer 
than those who score well on these timed tests? As indicated pre- 
viously (p. 132), the answer is more negative than positive, but the 
question is a complex one. If time allowances on speeded tests are 
increased, slower children certainly improve their scores, but they 
still generally remain poorer than quick ones. The rank order is but 
little affected and (according to the principles of Chaps. II and IV) 
it is relative, not absolute, scores that matter. At thesame time experi- 
ments do show that with highly speeded and rather easy tests, 
children’s performance is considerably affected by their quickness 
and by whether they aim primarily at speed or at accuracy ; whereas 
with very difficult, untimed material their persistence or willingness 
to go on trying may affect their results as much as does their ability. 
For most purposes it is desirable to compromise between these 
extremes, and to avoid mixing up our measures of ability with what 
are essentially temperamental or work attitude factors. Further, it 
would be administratively most inconvenient to use untimed tests, 
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which different testees would take very different times to complete. 
Hence it is possible that some of our present group intelligence and 
attainments tests (particularly in arithmetic) are unduly speeded. If 
we inclined towards tests with longer time limits and more difficult 
items, very few children would alter their relative scores appreciably, 
but it is likely that our predictions of future educational achieve- 
ment would be slightly improved. We know, too, that emphasis on 
speed unduly handicaps older persons, say 30 years upwards, hence 
ample time limits are preferable among adults. It may be that it also 
upsets emotionally excitable or nervous children, and this is one 
reason why Clinic psychologists prefer to use individual tests where 
the testee gives his answer at his own pace. 

5. Objectivity.—With good group tests the conditions of 
application and the scoring are much better standardized than with 
most individual tests. Very little is left to the personal decision of the 
tester, and the marking is, or should be, completely foolproof. In 
actual practice many group testers still manage to commit gross 
errors, some of which will be pointed out in the next chapter. And 
modern individual tests like the Terman-Merrill and Wechsler afford 
much less scope for subjective whims than did the older ones. 
Nevertheless individual testers often have to judge the correctness of 
doubtful responses, and may, without departing from the instructions 
in the handbook, alter the testing conditions in a variety of ways. Yet 
there seems to be no evidence of serious discrepancies between the 
results of trained testers, although untrained ones may make grave 
mistakes, even sometimes leading to the wrongful certification of 
mentally defective children or adults. Reliability coefficients over 
short periods are very high, though greater difficulties leading to 
lowered reliability occur in testing children below the age of 5, and 
among maladjusted children or adults (cf. Tulchin, 1934). The ex- 
perienced tester should, however, be able to recognize when his 
results are liable to be affected by extreme unco-operativeness OT 
instability in the testee, and to treat them with due caution or to 
arrange for a re-test when desirable. 

Actually the rigid objectivity of the group test is distrusted by 
Clinic psychologists, since they know that the same physical situa- 
tion may mean very different things to different children. The 
flexibility of individual testing is to them a great advantage which, so 
far is known, they do not abuse. 

6. Dependence on Health and Mood.—Both group and individual 
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tests seem to be much less affected by these factors than might be 
anticipated. Experiments show little and often no difference between 
the scores of children when they are fatigued or ill, and when in good 
health, nor between those who are strongly motivated (e.g. by the 
promise of monetary rewards) and those who are merely encouraged 
in the ordinary way. Studies of the latter type sometimes show that 
testees who are making more effort may attempt more test items, but 
they also make more mistakes so that their scores are scarcely 
altered. There seems then to be little evidence for such statements as 
that the examination conditions of group testing make some testees 
unduly nervous, or that others are stimulated to do better by the 
competitive nature of the testing situation. All such conclusions, 
however, are based on average tendencies, which admit occasional 
individual exceptions, and Binet testing certainly indicates that a few 
testees are so resistant or timid that they fail to do themselves justice. 
More extreme differences in attitude to the test can undoubtedly 
distort the scores, and certain physical conditions such as bad eye- 
sight must have a like effect. No intelligence test possesses perfect 
reliability; and conditions such as these may be at least partly 
responsible. Variations in group test performance, as well as in 
individual, are known to be greater among maladjusted children. Yet 
at the same time there is no evidence that excitement or anxiety 
upsets the results of 11+ selection tests among more than a small 
minority. 

The great advantage of individual over group testing is that in the 
former the tester can generally discern the influence of abnormal 
conditions, whereas in the latter he has scarcely any control over, or 
insight into, them. It may be that while high scores on group tests are 
relatively stable and significant, low scores can arise from a variety 


of sources other than poor intelligence. However, a lot more careful 


research is needed in order to determine definitely the part played by 


such factors. : . 
7. Dependence on Practice or Coaching.—M ost intelligence tests 


are published, hence anyone can procure them, coach testees on 
them, and ensure considerably increased scores. It is a most foolish 
thing to do, since a child who is made to appear unduly intelligent 
may also legitimately be expected to be capable of unduly difficult 
school work. But whenever examinations for secondary school 
entrance, or for other competitive purposes, include intelligence 
tests, teachers and parents who wish the children to do well will 
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naturally be tempted. In individual testing it is fairly easy to recog- 
nize when children have been coached or practised, since their glib 
answers are very different from the answers of children who are 
puzzling out the tests for the first time. It is less easy to know what 
allowance to make. In group testing, on the other hand, one may 
suspect that scores are too high all round, though one can never be 
certain unless (as has actually occurred) a whole class answers 19 
test items out of 20 correctly, the twentieth being one where the 
teacher was mistaken. 

In order to prevent direct coaching, most Educational Authorities 
obtain tests which are freshly constructed and privately standardized 
every year by Moray House or other organizations. But copies of 
older, more or less parallel, versions are readily available, and there 
are large sales of booklets which contain test items very similar to 
those in the unpublished tests. Particularly at the 11-- stage, and in 
areas where the provision of grammar school places falls far below 
the demand, intensive coaching in schools, at home, or by outside 
tutors, is known to be widespread. Not only does this tend to distort 
the normal school curriculum (cf. p. 199), but also it produces severe 
strain and anxiety among many of the pupils. 

The topic has received considerable publicity, and numerous 
investigations have been carried out. These show that coaching and 
practice on intelligence test materials do make an appreciable 
difference at the selection borderline, though they are nothing like as 
effective as parents and many teachers imagine. Here we should 
distinguish practice in taking complete intelligence tests, without 
further instruction or knowledge of the right answers, from coaching, 
where items are explained and hints given. A single previous practice 
test raises the average I.Q. by about 5 points, and further practices 
yield diminishing returns up to a total of about 10 points. After 
some five tests there is no further improvement, and I.Q.s fluctuate 
irregularly or even begin to decline. When combined with coaching, 
the effects are somewhat larger, the average maximum rise being 12 
to 15 points. But these figures refer to children or adults who have 
never seen a group intelligence test before. When, as is usually the 
case nowadays, they are fairly familiar with tests, the effects may be 
only half as great. Moreover it has been shown that the effects are 
highly specific. Thus the kind of coaching which is given from pub- 
lished books of miscellaneous items, and which does not include the 
taking of a complete parallel test under examination conditions, is 
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remarkably ineffective, producing barely a 5 points rise. Again, the 
maximum improvement occurs remarkably quickly; if practice or 
coaching amount to more than a few hours they seem to defeat their 
own ends. 

We have admitted earlier that good schooling and home back- 
ground do genuinely stimulate the development of intelligence. But 
it is clear, from the facts just stated, that intelligence cannot be 
*taught" in at all the same sense as school subjects are taught. Prac- 
tice and coaching merely provide a limited facility in dealing with the 
rather roundabout and artificial types of items that occur in most 
group tests, in understanding and following the instructions quickly, 
and in the concentration of effort. Thus they have distinctly less 
effect on the more natural types of items used in individual tests, 
where the testee supplies his own answer in his own words, than they 
do on most multiple-choice or recognition-type group test items. 

Inareas where coaching is at all common, the only fair solution 
seems to be to allow a limited amount of practice and instruction on 
parallel tests in the schools. This will give uncoached children a 
reasonable familiarity with the sort of tests they are to take, and will 
generally bring them to the level of the more intensively coached. 
But it would be better still if tests were used primarily for guidance 
rather than for selection, since then the incentive to coach would be 
removed. At the same time, the mental tester can never afford to 
ing; the amount and kind of previous ex- 
es have had does make a difference. For 
example, in any experimental investigation involving the comparison 
of two or more tests, the scores on those taken later will be to some 
extent affected. The educational psychologist should realize that 
published test norms may be inapplicable to children who are 
unduly *test-sophisticated,' and should usually avoid giving tests in 
any school at intervals of less than, say, two years. 

8. Validity.—The various versions of the Binet test, as already 
mentioned, measure a poorly analysed hotch-potch of abilities, and 
they have been severely criticized by Spearman, Cattell and others on 
account of the weaknesses of their statistical construction and 
theoretical foundations. Burt (19395) likewise points out the hetero- 
geneity of item content. Perhaps still more serious are the variations 
in content at different age levels; for example, the verbal bias in 
Terman-Merrill becomes much more pronounced at the top end. 
Thus the scale cannot be said to be measuring the same intelligence, 


ignore practice or coachi 
perience of tests that teste 
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or even the same pattern of abilities, throughout. A further conse- 
quence of this heterogeneity is that it tempts testers to make dubious 
subjective analyses of types of items, as when they note that a child 
does especially well on the more practical items, or badly on the 
memory items, and jump to the conclusion that he is a ‘practical’ 
child in daily life, or has poor ‘memory’ for school work. The 
experienced clinical psychologist can obtain many suggestive hints 
as to qualitative features of the child's intellect and temperament 
from his responses to particular items, or groups of items, but they 
are only hints which require to be followed up in far greater detail 
before they can be regarded as of any diagnostic value. Qua test, it 
only yields the one general score, which can be regarded as rather a 
vague mixture of g, v : ed, and other factors. 

In contrast, the well-constructed group test is highly homo- 
geneous, each of its sub-tests or types of items involving much the 
same factor pattern. Spearman would have claimed that it was an 
almost pure measure of g, though we realize now that the content 
and form of the items, together with difficulty level and timing 
conditions, do introduce other factors. It seems paradoxical, then, 
that most psychologists prefer the relatively unscientific Terman- 
Merrill test when they are called on to make any decision involving 
the intelligence of an individual child. They regard it as their most 
valid mental measuring instrument, and as giving the best available 
index of a child's general intelligence. That less trust can be put in 
group test scores is shown by the fact that different group tests 
often give somewhat discrepant results. If they are fairly similar in 
make-up the correlation between them (over a few days or weeks) is 
likely to be between 0-8 and 0-9 in a representative group. But with 
greater diversity, as between verbal and non-verbal tests, the correla- 
tion may sink to 0-7 or below, and discrepancies between a child's 
two I.Q.s may reach 30 points or more, though the average or median 
difference should still not exceed about 7 points. At the same time, 
group verbal tests actually give at least as good predictions of future 
educational standing as do the Terman-Merrill or Wechsler, among 
normal children aged 10 upwards and adults. The superior validity 
of individual tests is thus partly illusory, though it probably does 
hold good for younger children, particularly in regard to intelligence 
outside the school. 

The probable explanation of this situation is that the intelligence 
which we recognize in daily life, in school or out of it, is not рше 8, 
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but a mixture of abilities. Binet and his followers chose items largely 
from children's everyday behaviour and thinking, regardless of their 
factor content, and the very impurities that they included happen to 
overlap very well with the mixture we want. In the group test, how- 
ever, the items are mostly of a more restricted and artificial type, 
though they happen to coincide fairly closely with the kind of in- 
telligence required for passing examinations. The child is forced to 
express his abilities through a less natural medium, and this intro- 
duces such components as facility-in-doing-multiple-choice-items- 
at-speed, which are irrelevant to anything we want to predict. 

A further consideration, especially among younger children and 
among the more unstable or abnormal testees of any age, is that the 
group test yields only a numerical score, with none of the additional 
indications which are so useful in interpreting individual test re- 
sults. The Binet tester has much more effective control over the 
testee’s state of mind. He ensures favourable co-operation, and can 
judge, albeit subjectively, whether his responses are typical of the 
best that he can do. Despite the small effects of mood, health, etc., 
on group test scores, there are likely to be many uncontrolled factors 
which upset these scores in individual instances, and which partly 
account for the imperfect correlations between different tests. It is 
artificial, again, to separate off intelligence as a purely cognitive 
variable from the rest of the personality. Normally it always operates 
in a certain emotional and social context. The group tester tries to 
ignore this; but the individual tester, by observing the child’s man- 
ner of approach to difficulties (particularly those offered by per- 
formance tests), from the kind of test response and from general 
can build up a picture of the child’s whole personality, 


conversation, 
s his available intelligence. Thus 


and see how that personality applie: 
he obtains a much more complete and practically useful view than 


can be got from a more scientific measure of ability factors, though 
admittedly this view is liable to subjective distortions. 

The present writer would hold that the most reliable testing of 
individuals must be done individually. But instead of one scale of 
somewhat indeterminate and variable content, he would prefer to 
see developed a series of shorter and more distinctive scales or sub- 
tests. Each sub-test should be shown, by factorial studies, to measure 
as homogeneous ап ability as possible, if necessary over a limited age 
range only. Such abilities need not be pure factors. Indeed over- 
lapping is unavoidable, as they would all involve g, and possibly 
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other factors, in common. But they should cover as wide a range of 
mental functions as possible, and so enable us to carry out more 
penetrating and more scientific differential diagnoses. Instead of 
merely summing the scores to provide a measure of general intelli- 
gence, we should combine the results of several scales with appro- 
priate weights to measure, say, intelligence in everyday life situations; 
a different combination would give aptitude for academic education, 
another for technical education, yet another for mental deterioration 
in different types of psychotic and brain-injured patients, and so on. 
Such weightings would be based on actual follow-up research, not 
on subjective judgments of what the scales measured. The Wechsler 
tests already go some way in this direction, but their sub-tests tend 
to overlap too highly, and they are mostly too unreliable to yield 
useful differential diagnoses. They give a good verbal I.Q. and a 
total (verbal + performance) I.Q.; but attempts to use patterns of 
scores, or weighted scores, on the different sub-tests for revealing 
various types of mental abnormality, seem to have broken down. А 
number of batteries of so-called differential aptitude tests have been 
published in America, descriptions of which are given by Anastasi 
(1954). Though some of these give promise of being more useful in 
vocational guidance than a single general test, they too suffer from 
insufficient reliability and too great overlapping. Thus the develop- 
ment of a really satisfactory set of tests, beyond the stage that 
Wechsler has reached, is by no means easy. 


LISTS OF TESTS AND METHODS OF MEASUREMENT 
IN EDUCATIONAL PSYCHOLOGY 


The following list includes most of the mental tests available in 
Britain at the time of writing, together with some particulars of age 
ranges, time requirements, publishers, prices, etc. A number of older 
tests, little used nowadays or out of print, are excluded ; so also are 
most tests whose sales are restricted to Directors of Education ог 
other specially qualified persons. A few tests published in Australia 
or in U.S.A. are included which seem likely to be useful to British 
psychologists and teachers. These and other tests from such countries 
can generally be ordered through the National Foundation for 
Educational Research in England and Wales (referred to in the list 
as NFER). 
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Publishers.—Publishers' names are abbreviated as follows: 


AE Australian Council for Educational Research 
B Baird Scientific Instrument Co., Edinburgh 

CL Crosby Lockwood 
G Gibson, Glasgow 
H Harrap 

ICP Institute of Child Psychology 

IPAT Institute for Personality and Ability Testing? 

L Н. К. Lewis, London 

LG Longmans, Green 


M Methuen 
NFER National Foundation for Educational Research 


NI National Institute of Industrial Psychology 
N Nelson 
OB Oliver and Boyd, Edinburgh 
PC Psychological Corporation, New York 
SRA Science Research Associates, Chicago 
U University of London Press 


Prices.—The prices quoted are those current in 1955 for 100 copies 
(including purchase tax), unless otherwise indicated. Naturally these 
are liable to fluctuation. Group tests are often available in dozens or 
sets of 25, at slightly increased relative cost. Single specimen copies 
with manuals are usually issued at about 2/- per test. An asterisk (*) 
in the prices column indicates that the answers can be written on 
blank paper or special answer sheets, or given orally, so that the 
printed blanks can be used over again. Where no price is quoted, 
the necessary material is given in the book referred to. 

Times.—When a definite figure for time limit is given, this usually 
means the actual working time, apart from instructions, etc. An 
e.g. 20-30, or c. 25, shows the number of minutes 


approximate figure, 
or indicates that the time varies with different 


including instructions, 


testees. 
Age Ranges.—These are given in two columns headed C.A., and 


M.A. The former indicates that percentile, standard score or 
other norms are available for all children in this range, whereas the 
latter implies that mean (or median) test scores only are listed for 
certain ages. Thus a test with C.A. 10-12 covers a wider range than 


1 Agent, Mrs. ЇЧ. M. Barnes, The Grammar School, Henley-on-Thames, 
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one with M.A. 8-141, since the latter will not discriminate below 
Г.О. 80 at 10-0 years, nor above 1.0. 120 at 12:0 years, 

References to books or articles where descriptions of, or norms 
for, the test may be found are generally shown, by date, after the 
author's name. Some additional references and notes are given in 
small type. Readers who require a detailed critical evaluation of any 
American published test should consult the latest edition of Buros's 
(1953) Mental Measurements Yearbook. This, however, covers only 
a proportion of British and Australian tests. 
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СНАРТЕК Х 
HINTS ТО TESTERS 


1. Choice of a Test.—In choosing a test for some particular purpose, 
the tester's first consideration should, of course, be its validity for 
that purpose. As we have seen in Chap. IX, he may be able to judge 
its validity to some extent by inspection of its content, but he would 
be better advised to seek out experimental proof of what it is that 
the test measures. Investigations by its author, or by others who have 
used the test, should be consulted. The mere fact that it is called a 
test of such-and-such an ability does not constitute evidence. The 
Second consideration (which greatly affects the first) is the test's 
reliability. Generally speaking, a test must be fairly long, and its 
Scoring must be objective, if it is to be satisfactory in this respect. 
Most authors of good tests publish evidence of their reliability in the 
accompanying manuals. Little trust Should be placed in tests whose 
Scores are not known to yield a reliability coefficient of 0-90 or over 
in a representative group—say, a complete year-group of children. 

2. Age Range.—Few, if any, tests presume to cover all ranges of 
ability, and most educational and intelligence tests are only intended 
Thus the tester should be careful to 
e tests those which will, when applied 
» be likely to yield a normal distribution 
nduly curtail this distribution at either 


above or below its average C.A., he must correspondingly adjust his 
choice of test. 
The ‘differentiatin 


& power’ of the test norms sh ted. 
They should show a abi 


fairly large and even increase in score from one 
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age level to the next. Suppose that a test containing, say, 200 items 
has the following norms from 6 to 13 years. The increases are large 
up to 9 or 10, but then tail off, till there is a difference of only 5 items 
between the 12- and 13-year scores. Such a test will clearly not dis- 
criminate or differentiate well at the upper end, and its effective range 
of norms is from 6 to 11 rather than from 6 to 13. 


Years . 6 7 8 9 101 ll 12 13 
Score . 40 71 101 128 151 168 179 184 


Similar considerations apply to the choice of tests whose norms 
are issued in percentile, or other, forms. 

3. Convenience of Arrangement and Scoring.—1f a test is to be 
used extensively, attention should be paid to its general arrangement 
from the point of view of ease in applying and scoring. With some 
performance tests and tests of special aptitudes, much time is wasted 
in setting out the material, or the instructions are very elaborate, or 
the scoring unnecessarily detailed. More modern American tests 
usually try to reduce these inconveniences. 

Group tests of attainment or intelligence may also vary greatly in 
‘mechanics’ or ‘lay-out.’ Some of the older ones have insufficient 
instructions and sample items or practice sheets, so that testees often 
fail to understand what they have to do, and a lot of additional oral 
explanation is needed. Sometimes the testees have to answer in their 
own words, and the scoring manual never allows for all the possible 
answers which turn up, so that much time is wasted in deciding 
whether or not an answer shall be counted. The positions of the 
answers are often scattered all over the page, instead of being ar- 
ranged in the right-hand margin. True, stencils are often provided 
which, when laid over the test page, show up the correct answers 
But, according to the present writer's experience, it is far more 
trouble to fit the stencils over each successive page than it is to com- 
pare the testee's booklet with a booklet on which all the correct 
answers are clearly marked. 

Many American group tests are now issued with an interleaved- 
carbon-paper device, so arranged that all the correct answers, and 
none of the incorrect ones, are automatically shown as crosses on 
the back of the page. The scorer then merely has to count the num- 
ber of crosses. Alternatively, testees may check their answers with a 
special carbon pencil; their blanks are then passed through an 


electrical machine which counts instantaneously the number of 
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carbon marks in correct positions, through the amount of current 
which passes. In this country, however, testing is seldom carried out 
on so large a scale as to warrant such ingenuity, and its consequent 
expense. 

4. Procedure.—Many of the following hints may seem puerile to 
some readers, yet they are all based on the writer's experience of 
mistakes actually committed by amateur testers. Such mistakes may 
often lead to quite serious consequences. 

The tester should first study the instructions and the test material 
carefully with a view to seeing precisely how the test works, and 
should then follow these instructions to the letter. Often it is worth 
the tester's while to take the test himself. In Binet testing, the results 
should not be trusted until such time as the tester practically knows 
the instructions, and the essentials of scoring, by heart, and until he 
has given the test to, say, twenty cases. 

5. Modification of Instructions or Materials.—The inexperienced 
tester is very apt to find flaws in the test or instructions which, he 
thinks, could be eliminated by some simple modifications. Very 
likely the test could be improved, but then it would no longer be the 
same test, and the published norms would no longer hold good. For 
the same reason a printed Broup test must not be duplicated or 
mimeographed in order to save expense. Quite apart from the 


probable infringement of copyright, the difficulty of the test is 
likely to be altered thereby. 


6. Timing.—Accurate timin 
mistake of 5 seconds may 
2% in average I.Q. Testers 
to use omnibus tests when 


E is essential. In some group tests à 
quite possibly lead to an error of 1% or 
Who do not own stop-watches are advised 
feasible, since these require only a single 
timing. If times are kept by a watch with a seconds hand, the exact 
minute and second at which a test is to stop should be written down 


as soon as the test has started. Mistakes very easily occur if the 
tester trusts to his memory. 


7. Underlining and Erasing. 
all rulers and indiarubbers s 
pupils will waste several 
rubbing out responses w 
instructions do not alrea 
pupils that, in order to с 
the first choice rapidly, 
clearly. 


— Before giving a group test to a class, 
hould be firmly removed, otherwise the 
minutes in ruling their underlinings, or 
hen they change their minds. If the test 
dy say so, the tester should explain to the 
hange а response, they should scribble out 
and then mark or underline the new choice 
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8. Distractions and Copying.—In all testing, possible sources of 
distraction should be avoided. For instance, a group test should not 
be given when a singing lesson is due to occur in the next classroom. 
The testees should be spaced as far apart as possible, since it is very 
easy, in multiple-choice tests, for one to overlook and copy the 
answers of another. Often the younger children have no intention of 
cheating, but are yet extremely anxious to write down the same as 
their neighbours, simply because the whole situation is unfamiliar to 
them, and they are afraid of doing the *wrong thing." 

9. Practice Sheets and Samples.—In most group tests for children 
there are preliminary practice sheets or sample items, where help 
may be given. The tester should make sure that even the dullest 
testees understand what to do, and that they can manage at least one 
or two of the sample items by themselves. 

10. Coaching.—No practice or coaching of any kind, beyond 
what is specified, should be given. Testees who gain an advantage by 
such means will no longer be comparable with those on whom the 
test was standardized, and the norms will be useless (cf. pp. 165-7). 

11. The ‘Subjective’ Situation.—At the same time as adhering to 
the standard objective conditions of testing, the tester should try to 
obtain, even in a group test, a favourable subjective atmosphere. If 
the tester himself is apprehensive, or bored, his mood is likely to 
communicate itself to a group of children. The testees should not be 
excited and anxious, nor, of course, lackadaisical, in their attitude 
to the test. Teachers should not peer over children's shoulders while 
they are working to see how they are getting оп; it inevitably upsets 
them. Still less should they criticise, or help, children with wrong 
answers. 

12. Testing and Teaching.—Many teachers make bad testers be- 
cause the teaching attitude is so ingrained in them that they cannot 
refrain from drawing morals, or pointing out mistakes. A test is not 
a kind of lesson, and children should not be expected to learn any- 
thing whatsoever from it. Rather it is a matter of stocktaking. The 
essential object is to obtain a record of the testees’ performances 
under certain standard conditions, so that these performances may 
be compared with those of other similar testees who have been 
tested under the same conditions. 

13. Scoring.—lhe scoring instructions must be followed as 


ulously as the instructions for applying the test. The tester may 


metic Meo 
he response provided in the scoring key, 


sometimes disagree with t 
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and think that some alternative is just as good or better. This may be 
so, but once personal opinion enters the scoring ceases to be ob- 
jective, and the results again cannot be compared with the test 
norms. The scoring of most group tests is quite mechanical, but 
errors very readily creep in, both in checking the right responses and 
in adding up correct checks. It is extremely desirable for such tests 
to be scored twice, preferably by different persons. 

14. The Subjective Situation in Individual Testing.—The conditions 
of individual testing require still more tact and sympathy than those 
of group testing. By means of preliminary conversation the child 
should be put at ease, and interspersed irrelevancies should keep him 
interested and responsive throughout. If he shows signs of fatigue, 
distraction, or resistance the testing should be broken off. The good 
tester develops a kind of ‘bedside manner,’ and avoids formalities. 
He does not read the oral tests from the handbook, but, knowing 
them by heart, says them in a conversational tone. In order to keep 
the child interested he must himself sound interesting. Encourage- 
ment should follow every response, but some caution is needed in 
praising incorrect responses, since many children may realize when 
they are wrong and become careless or discouraged if there is no 
criticism nor stimulus to do better. The presence of a third person, 
Particularly the mother or head teacher, constitutes a distraction 


jump about from one part of the 
he tester to choose the test which 
opriate to the child's mood at the 
ouping together tests with identical 
Ory tests); and it prevented all the 
at the end of the session. This pro- 


standard instructions and Scoring. 


15. Record Keeping.—The following principle is commonly 
adopted by experienced testers: never spend any of the child's time 
in making records of his Tesponses. Notes should certainly be made, 
not only of test responses, but also of manner, speech, etc. АП the 
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recording, however, should be done while the child is engaged either 
in doing a test, in irrelevant conversation, or in listening to the 
instructions for another test; and it should be so unobtrusive that it 
never constitutes a distraction. Poor testers not only waste time in 
recording responses and looking up the scoring, but also allow 
awkward gaps to occur which entirely. spoil the favourable atmo- 
sphere of alertness and interest. The reason for this is usually mere 
lack of familiarity with instructions and scoring. 

16. Impartiality—While the good tester enters very fully into the 
thoughts and feelings of the child he is testing, yet he also needs to 
exert considerable self-control. There are various ways in which he 
may (quite unwittingly) give illegitimate hints, for instance, by over- 
emphasizing the crucial words in an oral test, or by making empathic 
movements in the direction of the correct blocks, holes, etc., of a 
performance test. Further, he cannot help but feel more sympathetic 
with some children than with others. He should, therefore, fre- 
quently ask himself: *Am I in any way being prejudiced by my 
liking for this child ; am I giving hints or crediting him with doubtful 
passes because I believe that he is bright, which I would not do ifI 
believed that he was dull?" 

17. Studying the Whole Child.—The tester should never regard 
his job merely as measurement of the I.Q. or other test scores 
(although many doctors, teachers, and others may seem to regard 
that as his sole function). Rather he should try to understand, and 
to build up a picture of, each child's whole personality. By doing so 
his test results are more likely to be accurate, since he will realize 
when they are being distorted by emotional or other factors; and he 
will certainly be better able to interpret their significance when the 
guidance or treatment of the child is under consideration. Although, 
as we have seen, he is not justified in analysing a heterogeneous test 
like the Binet scale into tests of hypothetical discrete abilities 
(memory, verbal, practical, etc.), yet he can pick up innumerable 
hints regarding the child's intellectual and emotional characteristics, 
which can later be explored more thoroughly. Suggestions of specific 
educational difficulties may readily be followed up. Buehler (1938) 
shows that responses to the Ball and Field (or Purse and Field) test 
may give valuable diagnostic indications. Other Binet items and, still 
more, performance tests like the Porteus Mazes may be equally 
revealing (cf. Vernon, 1937; Porteus, 1952). These deductions should, 
of course, be checked later by comparing them with other sources of 
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information, such as the psychiatrist's case-record, when available, 
In this way a tester can always be extending his methods and im- 
proving their validity. 

18. Interpretation of Test Results: Unreliability.—No mental test 
score should ever be accepted at its face value, nor trusted in the same 
way as physical measurements are trusted. Even the best tests, it 
Should be remembered, only measure to within a certain Probable 
Error. In other words, a test score should be regarded as the centre 
of a range of possible scores. It may be correct to within, say, 5%, 
but also it may be in error by as much as 20 %, unless it has been 
confirmed by the results of other tests which narrow the range of 
uncertainty. 

19. Units of Measurement.—The imperfections of mental units 
should also be borne in mind, for example, the variations in size of. 
M.A. and E.A. units (cf. p. 70-1). The I.Q. is still more difficult for 
teachers and laymen to interpret correctly. When derived from 
M.A,/C.A. it is primarily a measure of rate of mental growth. But 
When it represents a standard score (as in Moray House, Wechsler, 
and other tests), it is—like a percentile—a measure of brightness 
relative to other children of the same chronological age. In neither 
case is it an index of the amount of intelligence the child possesses 
at that moment; Similarly the E.Q. is not a measure of present 


attainment. The M.A. or E.A., on the other hand, do provide such 
indices. The point та: 


the test scores of two children, A and B. 


А «+, · СА.9:0 МА. 11:0 Г.О. 122 
В. · С.А. 8:0 М.А. 10:6 LQ. 131 


Here it is A, not B, who is the more intelligent, since he can do more 
difficult intellectual tasks, and can be expected to be more advanced 
in school work. It is true, however, that B is the brighter relative to 
8-year-olds than A is relative to 9-year-olds; and as B is growing 
more rapidly in mental powers, it is likely that he will catch up with 
А in another four years, and (if the test is reliable and valid) that he 
will, as an adult, be the more intelligent of the two. 

In actual practical work of educational guidance or emotional 
treatment, the tester need hardly bother to calculate the Г.О. or 
E.Q., since what matters is the present level of the child's intelli- 
Bence and educational capacities. The only proper use of the I.Q. is 


y be brought out by the following example of. 
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for predicting a testee’s future career, i.e. for assessing approximately 
his ultimate educational or vocational level. 

20. Dependence of I.Q. on the Test Employed.—The naive tester ог 
layman usually assumes that a child would obtain the same I.Q. 
whatever the test employed. This is so far from true that the test 
employed should invariably be quoted along with an M.A., E.A., 
1.Q., or E.Q. The reasons for these discrepancies have already been 
given, and will merely be summarized here. 

(d) Chance errors of measurement, due to the fact that no test is 
perfectly reliable (cf. pp. 125-9). 

(b) Many tests are not well standardized. For instance, American 
group tests may give results which are too lenient (pp. 66-8). 

(c) Different intelligence tests have different dispersions of M.A.s 
and LQ.s, so that one test makes a bright child out to be much 
brighter, a dull child to be much duller, than does another test 
(pp. 71-3). ] 

(d) The testees may have had more practice or coaching relevant 
to one type of test than to the other (pp. 165-7). 

(е) АП tests are liable to distortion by unsuspected group factors, 
or by emotional or other influences, about which we possess little 
exact knowledge, and which we are not able to control (pp. 163-5). 

In view of this uncertainty over the norms and the dispersion of 
group intelligence tests, the inexperienced tester would be well ad- 
vised never to calculate individual I.Q.s from them, but to treat their 
results in purely relative fashion, i.e. in the way which was recom- 
mended for educational tests (cf. p. 68-9).1 The class teacher can 
obtain from a group test practically all the information which it is 
capable of yielding if he or she simply finds which child in the class 
has the highest score (corresponding to the highest M.A.), which the 
next highest, and so on. When test results are thus confined to rank 
orders, they are far less likely to be misinterpreted than is common 
at present. 

21. Conclusion.—Mental tests have been widely, and often un- 
fairly, criticized. But the progress of testing is probably hindered to 


1 An exception may be made in the case of 11+ selection tests, which almost 
invariably adopt a Standard Deviation of 15. Such a use of quotients provides 
the simplest way of making age allowances, without which far more of the older 
children in а year-group of candidates than of the younger ones would gain 
places. At the same time, egregious statistical errors, leading to gross unfairness, 
can be and often are committed when Education Authorities try to run such 
. examinations without the advice of a statistically trained psychologist. 
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a greater extent by its friends, who are ignorant of its limitations, 
than by its enemies. One of the main objects of this book is to 
analyse these limitations and to explain the underlying principles 
which must be understood if tests are to be applied and interpreted 
properly. No one ought to use tests who is not willing to familiarize 
himself with these matters, and with the literature which bears on the 
particular tests that he employs. In skilled hands, testing provides a 
far surer and more accurate tool for the assessment of abilities than 
do subjective impressions or the ordinary examination paper. But 
human nature is far too complex to be measured in the simple and 
direct ways in which physical quantities can be measured, and the 
over-enthusiastic but unskilled tester is. only too likely to make 
serious mistakes, 


CHAPTER XI 
EXAMINATIONS 


Functions of Examinations.—The chief functions of ordinary written 
examinations may be summarized as follows: 

1. They are employed as tests of achievement, to prove that an 
individual pupil has, or has not, acquired a certain amount of know- 
ledge of a subject. 

2. They are given to a whole class, or school, for a similar purpose, 
or for assessing the efficiency of the teachers whose pupils are 
examined. Although ‘payment by results’ has disappeared, schools 
are still commonly judged by the public on the basis of their ex- 
amination successes. And both head teachers and inspectors try to 
ensure the maintenance of certain standards by means of periodic 
oral or written examinations. 

3. Many of the most important examinations are assumed to 
possess a prognostic function, i.e. to be capable of predicting future 
achievements. Thus they are used to bar the gates between the 
primary and the secondary school, between the secondary school 
and the university, between the university and many professions; 
and they are supposed to select the few who are best fitted for these 
higher stages. 

4. Many qualities besides achievement, or promise of achieve- 
ment, in particular scholastic subjects, are believed to manifest them- 
selves in examination success. The business employers who insist 
that their clerks shall have passed Matriculation or the General 
Certificate do not actually desire employees who are good at Latin, 
biology, and the like. When asked why they attach such importance 
to academic examinations, their answers show that they hope, rather, 
to acquire clerks of good general all-round ability in non-academic 
as well as in academic subjects. They say also that success at an 
examination requires a certain amount of perseverance and in- 
dustry, some docility to discipline, and calmness under pressure; 
and that these qualities of character are desirable also in their 
employees. The significance attached to a university degree is often 
almost as variegated. Employers are probably quite as much to 


M.A.—14 
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blame for the stranglehold of the contemporary examination system 
as are the teaching profession and the education authorities. 

5. Another function of examinations whose existence is seldom 
explicitly admitted is that they constitute an incentive for a 
pupils and students to work, particularly in the secondary schoo 
and university; also for keeping teachers up to the mark. Indeed, 
some would maintain that the prospect of having to do examinations 
must still be employed as one of the chief motives in education, even 
if their results are entirely worthless. Clearly, then, examinations are 
an essential feature of the whole educational system which relies 
largely upon extraneous motives, such as competition, fear of 
punishment, and blame. And they can hardly be discarded until the 
more progressive ideal of appealing to the pupils’ or students 
spontaneous interests becomes a practical reality. 

6. Many American writers (e.g. Kandel, 1936) believe that the 
only legitimate function of examinations should be that of guidance. 
By checking up the results of his instruction the teacher can see 
where he has failed to make some topic clear, and can improve his 
methods. By examining a pupil’s accomplishments he can both 
diagnose the pupil’s weak points which require special attention, and 
can direct his endeavours along the lines for which he is most suited. 
Secondary schools and universities should apply examinations to 
entering candidates, not so as to keep out all but a select few, but to 
discover the fields of work in which each candidate is likely to do 


best, and should dissuade from entering only those who could 
profit from no kind of advanced study. 
Unfortunately, 


the traditional written examination fails very 
badly in its attemp 


ts to fulfil these many different functions. We will 
consider first the criticisms which have been levelled against it, and 


later ask whether other measuring instruments would serve the 


purpose better, or whether its adequacy can in any way be improved. 


Criticism of the Effects of Examinations upon Education.—The 
first and commonest complaint i 


s that examinations dominate and 
distort the whole curriculum. Although only about 5% of the 
population reach the university, every grammar school pupil has to 
study for the General (or the Scottish Leaving) Certificate as though 
his ultimate objective was a university degree. The schools cannot 
afford to educate in the broader sense, nor to give the majority of 
pupils such knowledge as would be of Most use to them in after 
life, because they must work towards examinations concerned 
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almost exclusively with formal, academic, subjects. Again narrow 
specialization is apt to occur in grammar and public schools, for the 
sake of university scholarships. 

Even more widespread are the bad effects of the 11+ secondary 
school entrance examination. Subjects or activities which do not 
contribute directly to passing the tests are generally neglected in the 
last year or two of primary schooling. In the larger schools, children 
are streamed even at the infant stage into classes thought likely to 
pass or fail. Homework and coaching out of school hours are intro- 
duced at too early an age—more often; be it noted, at the behest of 
the parents than of the teachers. 

Furthermore, these examinations stimulate an unhealthy, com- 
petitive spirit in children, and at all ages they encourage the cram- 
ming of set books and rote memorization. Doubtless they act as an 
incentive (Function No. 5) but not an incentive to learning for its own 
sake. Indeed, examinations are an inefficient type of incentive, since 
most pupils and students regard them as too remote to bother about 
until a few weeks or days beforehand. A few candidates, however, 
take them much too seriously, or are forced by their parents and 
teachers to over-work ; with the result that many of the maladjusted 
children who come to Psychological Clinics for advice and treat- 
ment show signs of this examination strain. 

An additional excuse is often advanced for this state of affairs, one 
which is based on the old fallacy of formal discipline. It is said that 
although the passing of examinations in academic subjects is of no 
direct value to the vast majority of the population, yet the pupils’ 
minds are trained thereby so that they become better able to learn 
other subjects, to reason, to write good English, and so forth. А 
large body of experimental evidence not only proves that this spread 
of training in one field of study to other fields is usually extremely 
limited in extent, but also suggests that the very opposite may occur. 
The system may indeed train pupils and students in the best methods 
of passing examinations in other subjects, but it is more likely to 
stunt 'reasoning power' owing to its emphasis on mechanical 
memorization. Often it produces a dislike of everything associated 
with school, and so discourages the desire for further education. It 
does not help to develop writing of good English, since the English 
style in which most examination answers are written is known to be, 
on the average, markedly inferior to the style in which the same 
pupils can write compositions at leisure. 
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INADEQUATE RELIABILITY OF EXAMINATIONS 

We will not, however, dwell on these points since, for the purposes 
of this book, it is a more serious matter that most examinations are 
very defective methods of measuring the achievements or the future 
promise which they set out to measure. Their chief, though not their 
only, flaw is inadequate reliability. The following facts are well 
established : | А 

(а) When examinees sit two different examinations dealing with 
the same scholastic subject, both set and marked by the same 
examiner, they are likely to obtain distinctly different marks on the 
two occasions. 

(0) When a single set of scripts is marked by two different 
examiners, the resulting marks are liable to show wide discrepancies. 

The main causes of poor reliability are: 

(1) Actual changes in the examinees’ mental and physical states 
which affect their examination performances. 

(2) Inadequate sampling of the examinees? knowledge and ability 
by the particular questions set. 

(3) Inconsistencies in the standards of marking adopted by 
different examiners, or by the same examiner on different occasions. 

(4) Differences of opinion between different examiners regarding 
the relative merit of the examinees’ answers. 

1. Changes in the Examinees.—This was described above (p. 128) 
as ‘function fluctuation.’ Two closely parallel examinations sat on 
Consecutive days will always give somewhat divergent results, be- 
cause some examinees will be feeling more fit, or in a better frame of 
mind, at the first of the two, others at the second. Many external and 
irrelevant happenings may temporarily depress or improve the work 
of a few individuals. Over long intervals the examinees' abilities will 
naturally show considerable developments and alterations. On the 
whole, however, the uncertainties occasioned by such factors are 
probably smaller and less important than those due to the other 
three causes. 

2. Inadequate Sampling.—This factor is responsible for the com- 
mon complaint among examinees, namely, that the questions ‘did 
not suit them,’ that on another set of questions they might do better 
(or worse). A pupil’s Capacities in the various special branches or 
aspects of a subject are always somewhat uneven. And no examina- 
tion is capable of bringing out everything that the pupil can do in 
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that subject, or of measuring all his strong and weak points. Hence, 
the half-dozen or so questions which he answers are never more than 
a sample of the questions which would be needed to cover the whole 
field. Common sense, as well as statistical principles, tell us that 
the more lengthy the examination, and the greater the number of 
questions, the more representative will the sample be, and the better 
will the ground be covered. Thus in certain university honours 
examinations where the students! marks are based on twenty-four 
hours of examination work, their marks would be unlikely to alter 
to any great extent if they sat another similar series of papers. But 
when an examination consists, say, only of one short paper in 
English and one in arithmetic, quite large alterations might easily 
occur in further parallel papers. The correlation between results on 
ordinary 1- or 2-hour papers in the same subject depends con- 
siderably on the heterogeneity, or range of ability, among the 
examinees (cf. p. 122); but a fairly typical figure is + 0-70, and it is 
often less than this. We can therefore predict, by applying the 
Spearman-Brown formula (p. 127), that examinees would require to 
take at least four such papers before the ground was sufficiently 
covered for the reliability coefficient to rise to the fairly respectable 
figure of + 0-90. 

Suppose that one hundred pupils take such an examination, and 
that the passing mark is fixed so that half of them pass. Now if they 
take a second similar paper which correlates + 0:69 with the first, 
we can predict that only 37 of the pupils would pass on both, and 
37 fail on both, and that as many as 26 would pass the first and fail 
the second, or vice versa. This hypothetical but only too typical 
illustration demonstrates the unfairness which inevitably results 
from the lack of thoroughness of many examinations, and in par- 
ticular the difficulties which arise when the examination is supposed 
to divide the candidates into two distinct categories at an arbitrary 
pass mark. 

An additional reason for the variability of a pupil's level in 
answering different questions or sets of questions is that the majority 
of examination answers consist of English compositions or essays. 
(This does not usually apply to papers in mathematics or in foreign 
languages; but it does to English, history, geography, science, and 
other subjects.) And the English essay is known to be a particularly 
variable production. The compositions written by a pupil weekly 
throughout a school year fluctuate widely in their merits. As Ballard 
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(1923) points out, teachers may refuse to accept this statement, and 
claim that their pupils are fairly consistent from week to week. But 
the fact that a teacher marks any one pupil's work on a dead level 
probably indicates that, unwittingly, he has formed a stereotyped 
conception regarding his ability, and is marking the successive 
compositions not on their own merits but in the light of his know- 
ledge of the pupil's average achievement. The truth of this suggestion 
is borne out by the many stories of the two pupils who regularly 
received, one a good, the other a bad, mark, and who continued to 
do so even when they exchanged their essays. 

An outside examiner does not possess this knowledge of the 
pupil's average work, and so will give poor marks to those who 
happen to write answers much below their normal level, good marks 
to those who rise above it. Moreover, under the stress of examination 
conditions, such deviations are likely to be exaggerated. Probably 
much more representative products would be forthcoming if the 
examination could be answered at leisure, under conditions similar 
to those which apply when pupils do written work at home, or at 
school. The common practice in Civil Service, and certain other, 
examinations of using a single essay as an indication of the candi- 
dates' all-round literacy, cultural background, and *quality of mind" 
is particularly deplorable, as pointed out by Vernon and Millican 
(1954) in an investigation of students" essays on several different 
topics. 3 

3. Divergent Scales of Marks.—This factor has already been dis- 
cussed fully in Chaps. II and IV, hence we will not pause to re-argue 
the illogicality and injustice of a system which permits a far higher 
proportion of passes, or a higher average mark, to be awarded in 
one university or General Certificate subject than in another. If a 
mark of 80%, or a first class or credit, signifies to one examiner the 
standard achieved by a quarter of the examinees, and to another the 
standard reached by only 1% of the examinees, then it is obvious 
that this mark means different things to different people. And it 
follows that an examinee’s mark will vary with the particular ex- 
aminer who happens to read his papers, unless all the marks given 
by all the examiners are uniformly standardized in accordance with 
the principles outlined above. 

The standards of marking even of a single examiner are liable to 
alter with fatigue and mood. When marking large numbers of answers 
to the same question, he may begin very strictly and gradually 
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become more lenient, or vice versa. The same script might receive 
a different mark if read after, instead of before, dinner. 

4. Differences in Opinion as to Relative Merit. —The most impor- 
tant source of unreliability (which is usually found in combination 
with No. 3) is the personal element which enters into every ex- 
aminer's evaluation of a script. We will first summarize some of the 
evidence regarding this subjectivity. The classical experiments were 
carried out in America as far back as 1912. Starch and Elliott (1913) 
sent copies of a single geometry paper to the chief geometry teachers 
of 116 high schools, with the request that they should mark it in 
accordance with their usual practice. The marks awarded varied all 
the way from 28% to 92%. Two of the teachers gave it more than 
90%, 18 gave it between 80% and 90%, 18 gave it between 60% and 
30%, and 2 under 30%. Similar results were obtained with papers in 
English and in history. Equally striking is the anecdote related by 
Wood (1921) of the papers which were marked in the usual way by 
six examiners. The first examiner, for his own guidance, wrote out a 
set of (what he regarded as) model answers to the questions; but he 
unfortunately left this among the candidates’ papers. He was subse- 
quently awarded marks varying from 40% to 90% by the other five 
examiners! Many experiments, conducted under far more strictly 
controlled conditions than these, have substantially confirmed their 
findings. Particularly noteworthy are the elaborate studies carried 
out in England and France in the 1930s, under the auspices of the 
International Examinations Enquiry Committee. 

Hartog and Rhodes (1935) arranged for typical groups of scripts 
to be marked by experienced examiners under normal examination 
conditions. Special place, School Certificate, and university honours 
examinations were represented, and several different subjects were 
studied. Generalizing from a number of their results, it appears that 
when half a dozen examiners independently mark a set of scripts, the 
lowest and highest marks given to any one script differ on the average 
by about 10 to 12%. Among a few of the scripts the discrepancies 
may be quite small, but among others they may be far larger, often 
greater than 20%. The authors show that part of the divergence may 
be ascribed to the different standards of leniency adopted by different 
examiners (No. 3, above), but that even when this factor is elimin- 
ated, large variations remain. In several of the experiments the 
examiners drew up an agreed scheme of instructions for marking, 
listing both the general characteristics and the detailed points which 
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were to be looked for in the scripts. In the marking of English com- 
positions, where this was scarcely possible, the inconsistencies were 
much greater. 

Now Hartog and Rhodes doubtless desired to shock the public 
conscience with regard to examinations, and so did their best to 
emphasize the disagreements. The present writer is more'impressed 
by the smallness than by the largeness of the discrepancies, and 
would conclude from his study of the results that important public 
examinations, like the School Certificate and university honours, are 
marked much more consistently than is usually the case. Instead of 
laying their chief emphasis on the extreme discrepancies, the authors 
might have pointed out that the median disagreement between anytwo 
examiners is not more than 3% in the best-conducted examinations. 
Since this figure represents the Probable Error of an examinee's mark, 
а few examinees will naturally show discrepancies four or five times 
as large. It represents an average correlation between different 
examiners of + 0-80, when the standard deviation of their marks is 
10%. And decidedly lower figures have been obtained by other in- 
vestigators. The authors also omit to point out.that the reliability of 
an examinee’s final total mark is generally very much higher than 
their results indicate, since it may be based on half a dozen or more 
ог more examiners. Their results mostly 
T two two-hour, papers. 
deserves to be taken very seriously, since 
5 thorough examinations are deplorably 
ence of a scheme of instructions drawn up 
ced examiners, much worse discrepancies 


h € average and the dispersion of marks are 
not standardized, gross differences may appear in the proportions of 


Credits, Passes, Fails, etc., which are awarded ; and that even a P.E. 
of 3% may frequently make all the difference between a Pass anda 
Fail, or a First and a Second Class, 
Another example will be quoted which better illustrates the sub- 
nder ordinary school conditions. Boyd (1924) 

collected large numbers of essays on “А Day at the Seaside’ from 
11-year-old Scottish school-children (i.e. children at the qualifying 
Stage). From these he selected 26 as representing the whole range of 
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merit, and had them marked independently by 271 teachers. The 
marks were not numerical, but consisted of seven grades, ranging 
from Excellent to Unsatisfactory. On comparing the results of 
different markers, it was found that almost every essay received at 
least 6 of the 7 possible grades. The following instance is typical: 3 
teachers marked it Excellent; 31 VG +, 80 VG, 125 G +, 28 6, 
and 4 marked it Moderate (the lowest grade but one). Boyd shows 
that part of the discrepancy was due to differing standards. But even 
when the 20% most lenient markers and the 20% most severe were 
excluded, the average essay was still awarded at least 4 out of the 7 
grades. He also points out that the average grade given to an essay 
by а large number of markers is highly consistent, and that the 
various divergent individual grades are distributed around this 
average more or less in accordance with the normal distribution 
curve. Actually it has been shown by Wiseman (1949) that a reason- 
ably reliable mark for English composition can be obtained at the 
11 + selection stage (since the range of ability is very much wider 
than it is among highly selected grammar school pupils or university 
students), provided that four examiners’ marks are combined. It is 
desirable also that each candidate should write on two or more 
topics. The burden of marking is not so heavy as might appear, since 
only the essays of some 10 to 20% of borderline candidates need to 
be considered; and rapid general impression marking at the average 
rate of one minute per script is adequate. 

Other experiments, including one of Hartog’s, indicate that when 
the same examiner re-marks a set of scripts after an interval, the 
discrepancies between his first and second opinions are likely to be 
almost as great as those between two independent examiners. Some 
of the studies by the French members of the International Examina- 
tions Enquiry are also worth quoting. A set of papers in science were 
marked by three trained scientists (one of them doing it twice over), 
and also by a student who had no scientific training. When the mark- 
ings were compared, the average correlation coefficient between the 
scientists was - 0-63, whilst the correlations between the student and 
the scientists averaged -+ 0:51. The latter figure is lower than the 
former, but so little lower that it casts some suspicion on the value of 
an examiner’s knowledge of his subject. 

Particularly striking was a study of the philosophy examination in 
the Baccalauréat. One hundred scripts were marked by six examiners. 
Only 10 of the candidates were passed by all the examiners, only 9 
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failed by all of them. The remaining 71 were passed by some, failed 
by others. It is, of course, always the examinees of medium ability 
whose true standing is most uncertain. The few best or few worst are 
recognized as such by all examiners. But only too frequently an 
arbitrary dividing line—the pass mark—has to be erected in the 
middle of this region of maximum uncertainty. Undoubtedly, there- 
fore, a large number of candidates who just succeed, or just fail to 
gain a General Certificate pass or a scholarship might very easily 
obtain the opposite result if they were marked by a different examiner, 
or if they sat an alternative set of papers. In many instances their 
whole educational or vocational career may be blighted owing to 
the unreliability of their marks. 

Reasons for the Subjectivity of Marks.—The prime reason for the 
discrepancies which we have described is that different markers con- 
ceive differently the desirable or undesirable characteristics of exam- 
ination answers. In other words, they attempt to assess many diverse 
types of abilities. One examiner may give chief credit for evidence of 
work done, and for the comprehensiveness and accuracy of the facts 
contained in the answers. Another may look rather for signs of 
future promise, originality, and grasp of general principles. One may 
insist on clarity of expression of ideas, another may try to assess the 
profoundness of the ideas irrespective of their good or bad expres- 
Sion. Some may search for emotional rather than strictly intellectual 
qualities, such as the examinee's interest in the subject. Whenever a 
candidate states his attitudes or opinions, these are liable to coincide, 
or clash with, the attitudes or opinions of the examiner ;and however 
desirous the examiner may be of maintaining impartiality, he is liable 
to bias if his pet theories are approved or attacked by the candidates. 
Often, if he would take the trouble to formulate explicitly his notion 
of the aim of the examination, he would find that he means by a 
good examinee the one who closely approximates to himself, and as 
а poor candidate the one who shows none of the intellectual or 
emotional qualities which he himself idealizes. Is it to be wondered, 
then, that different examiners differ in their opinions of a candidate? 

We must recognize that any one script may manifest an extreme 
complexity of features, good and bad, and so can be looked at from 
amultitude of different points of view. Its merit can never be assessed 
in the same way that a physical object is measured. An examiner 


1 This result suggests that the correlation between any pair of examiners 
averaged only + 0-60. КЕ 
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ordinarily splits it up into its chief components, notes the facts which 
it contains or omits, its elements of originality, style of expression, 
the mechanics of its grammar or computations, and so on. He 
attaches a certain weight to each of these possible components (a 
weight which is likely to be highly individualistic, unless a detailed 
scheme of marking has been adopted), and finally re-synthesizes the 
candidate's performance on each component into a single mark for 
the answer as a whole. Some examiners do not attempt a systematic 
analysis of this kind, and merely mark on the basis of a general 
impression. But as Ballard (1923) and Hamilton (1929) show, every 
general impression involves a more or less vague analysis. 

Even in such subjects as arithmetic, or the translation of foreign 
languages, the weightings assigned to the various analysable features 
of an answer, and the penalties given to various possible mistakes, 
are liable to differ among different markers. But the complexity is far 
greater in the case of English composition, or in those subjects such 
as history, science, etc., where the answers consist chiefly of com- 
positions. The essay-type of answer has been very widely, and justi- 
fiably, criticized. Examinees have to waste a large proportion of the 
available time in translating their historical or scientific knowledge 
into readable essay-form, and in the mere process of writing. This 
latter is particularly unfair, because some can write much faster than 
others without necessarily being better historians or scientists. A 
considerable proportion of the written words (for instance, the 55 
the's, and's, etc.) add nothing to the evidence of the examinee's good 
or poor ability. Investigation shows that the average examinee puts 
into writing less than one fact or idea per minute of examination 
time, since the process of expressing these facts or ideas takes so 
long. The examiner, then, has the still more difficult task of transla- 
ting the product back again, of trying to penetrate through the 
verbiage to the signs of ability and knowledge beneath. Few exam- 
iners can remain entirely uninfluenced by the literary style or the 
handwriting of an answer (there is direct experimental proof of this), 
although they are presumably trying to mark the product for its 
historical or scientific merit, not qua essay. 


INADEQUATE VALIDITY OF EXAMINATIONS 


The poor reliability of examinations greatly reduces their value for 
any and every purpose. For if they cannot even measure examination 
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success without serious errors, they naturally cannot validly predict 
anything else. Even, however, if perfect reliability were established, 
validity would still be diminished by the dependence of examination 
success on factors other than ability, knowledge, and industry. : 

One such factor is, obviously, the efficiency of the examinee’s 
teachers as examination coaches, and their cleverness in guessing the 
kinds of questions which examiners will set. It is manifestly unfair 
that a pupil at one school, where the teachers are specially adept at 
instilling examination knowledge, should succeed when a pupil of 
equal ability fails because he attends another school, where the 
teachers are either less skilled, or more humane in the education 
which they provide. 

There is also the old complaint that some pupils are ‘good exam- 
inees,' while others become unduly ‘nervous,’ and ‘fail to do them- 
selves justice One wonders whether it is not usually the weaker 
candidates who make this complaint, and whether they would do 
better if subjected to any other kind of test, including the tests of real 
life. However, we may admit that certain temperamental qualities, 
Which we can as yet hardly specify, together with good physical 
health, may confer some advantage on certain candidates, which 
other candidates lack. 

Examinations, then, certainly fail to fulfil the first two functions 
ly, the measurement of achievement, 
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State scholars only one-tenth. But this meant that practically one- 
half of those who obtained other types of scholarships (from their 
schools or from local authorities) did not justify themselves (cf. 
Valentine, 1938). 

In a further study by the Scottish Council for Research in Educa- 
tion (1936), marks gained by secondary school pupils at the Leaving 
Certificate examination, together with teachers" marks and the head 
masters’ estimates of the pupils’ general standing, were correlated 
with subsequent class marks and degree marks at the universities. It 
was found that on the whole those who got first-class honours had 
been rated higher by the school head masters than those who got 
second class, and that honours students had been rated higher than 
ordinary degree students; but that, as in Valentine’s research, there 
were a great many exceptions. The actual correlation coefficients 
between school and university marks varied from about + 0:30 to 
--0-70, and the average of several such results was + 0-48. In the 
light of our discussion of correlations (Chap. УП), this figure im- 
plies very poor predictive validity. 

It should be remembered, however, that the above investigations 
(together with most of the studies of reliability) dealt with highly 
selected groups. At the 11+ selection stage practically the whole 
range of children is examined, and more reliable tests than the 
conventional subjectively marked papers are generally used nowa- 
days. The result is that this examination gives correlations around 
-- 0-85 with performance in secondary school work 2 to 5 years later 
(Emmett and Wilmut, 1952). Moreover, in view of the imperfect 
reliability of the school marks used as a criterion of achievement, the 
true figure should be of the order of + 0:90. Yet even this coefficient 
implies that many mistakes are made in selection. If 20 out of 100 
are admitted to grammar schools, 5 that is one-quarter—are likely 
to turn out unsuitable, and 5 who have been rejected and sent to 
modern schools would have surpassed them. 

It would be unreasonable to expect perfect correlations between 
marks at successive stages in pupils’ careers, since the subjects 
studied at a secondary school differ distinctly from those studied at 
a primary or preparatory school, even when called by the same 
names. And the conditions of work at a university differ from those 
at a school, being more congenial to some, less stimulating to 
others. Again, every individual changes to some extent as he grows 
older, some of his abilities becoming relatively weaker, new ones 


210 THE MEASUREMENT OF ABILITIES 


developing. All in all, then, the inability of entrance and scholarship 
examinations to choose those candidates who will profit most from 
higher education is not very surprising to the psychologist. Yet their 
validity for this purpose is certainly much poorer than is generally 
supposed by the educational authorities who set the examinations, 
and who base their awards and selections upon them. The universi- 
ties would scarcely continue to dictate the present constitution of the 
General and Leaving Certificates, and Matriculation examinations, 
if they did not believe that these instruments picked out the most 
suitable entrants. 

Still more unjustified is the belief embodied in the fourth of the 
functions outlined above (p. 196), namely that these examinations 
are adequate criteria of vocational fitness. It is indeed true, as em- 
ployers assert, that successful candidates will on the average be 
better in general all-round ability than those who have failed to gain 
the Certificates, since all scholastic abilities are to some extent 
positively correlated with intelligence. It is probably true, also, that 
vocationally desirable temperamental and character qualities may 
aid in examination success. But so do a great many other factors. 
Some pupils with extremely undesirable characters may do well at 
examinations through intellectual brilliance; others through efficient 
coaching on the part of their teachers, or through over-cramming; 
still others through the pure chance factors introduced by the 
subjectivity of marking and other forms of unreliability. The con- 
clusion follows, then, that success or failure may depend on so 
many diverse qualities, that it does not provide trustworthy evidence 
of any of them. The employer would be far better advised to make 
use of the services of a scienti.ic vocational psychologist, who would 
measure or obtain the best available assessments, both of intelligence 
and other aptitudes, of educational achievements and of tempera- 
mental traits, and would give each of these qualities its due weight 
in deciding on the suitability of prospective employees. Undoubtedly 
the adoption of a similar scientific study of each individual candidate 
for secondary or university education would lead to many more just 
selections, and more accurate predictions of future achievement, than 
do the present methods (cf. Dale, 1954). 

Although this chapter has consisted mainly of destructive criticism, 
certain hints have been given as to possible ways and means of 
improving examinations. These will be followed up, and expanded 
in the next chapter and in Chap. XIV. 


СНАРТЕК ХП 


RELATIVE ADVANTAGES OF NEW-TYPE AND 
ESSAY-TYPE EXAMINATIONS 


Ir has been widely claimed that the worst defects of the traditional 
type of written examination may be overcome by adopting in its 
place the objective or new-type test for measuring achievement. 
Such tests came into common use in the United States many years 
ago, but are still little known in this country, in spite of Ballard's 
(1923) able advocacy. Americans, however, realize now that they 
also are liable to certain defects when used undiscriminatingly, and 
that there are still functions which can better be performed by the 
old- or essay-type of examination. Thus we are in a position to profit 
both from their researches and from their errors, to devise our tests 
in accordance with the principles they have already worked out, and 
to apply them to the fields of educational measurement where they 
will be most useful and most valid. 

Now it is clear that some educational products can be marked 
much more objectively and reliably than others. In the early stages 
of school arithmetic, when children are given a series of short sums 
to do, there is no subjective element involved in deciding which ones, 
and how many, they do correctly. But before long arithmetic be- 
comes more-complex; each sum involves a number of different 
operations, partial credits are essential, and different markers are 
now liable to differ. An English composition, or an examination 
answer written in essay form, brings in a still greater personal cle- 
ment, since it is still more complex and cannot ever be considered 
merely from the point of view of right or wrong. The new-type 
tester argues, therefore, let us refrain from setting sums or essay 
questions which involve many operations, and instead ask questions 
which evoke each operation separately, questions to which only one 
right answer is possible. And instead of leaving to the marker's 
personal taste the analysis which must be made of a complex product, 
and the weighting which must be attached to its component ele- 
ments, let the person who sets the questions make the analysis 
beforehand, and decide systematically just now many marks (i.e. 
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how many questions) shall be allotted to each element. 

Advantages of New-type Examiriations.—So we arrive at the new- 
type examination, which contains a large number of brief questions 
instead of a few long ones. Every, question is so arranged that all 
markers will agree as to the rightness or wrongness of an answer. 
This elimination of the personal element is not the only advantage. 
For the hundred or more questions of the new-type test are capable 
of sampling the whole field of knowledge much more compre- 
hensively than can the half-dozen or so questions of the orthodox 
examination. Candidates can no longer complain that the particular 
questions failed to suit them, since all their strong and weak points 
are probed. This fact should help to reduce the ‘spotting’ and cram- 
ming of likely questions. Again, the subjectivity of standards of 
marking need no longer cause trouble, since the examinees’ marks 
are pure ‘count scores’ (cf. p. 20). So long as the questions are de- 
signed to cover all levels of difficulty evenly, the distribution of marks 
is sure to approximate to normality, and can readily be converted 
into any desired percentage or other scale. The examiner may, of 
course, draw an arbitrary pass line at any point in the distribution 
that he wishes, so as to cut off the examinees who fail to come up to 
his minimum standards. 

A full description of the various sorts of questions which may be 
employed in new-type tests is given in the next chapter. Here it will 
be sufficient to distinguish between the so-called ‘open’ or ‘recall’ 
type of question and the ‘closed’ or ‘recognition’ type. In the former 
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do not distort the examiner's marking. It has been claimed that the 
average examinee, instead of expressing less than one idea or fact 
per minute (as in an essay-paper), can deal with three to six new-type 
questions a minute. But this naturally depends on the level of diffi- 
culty of the questions. A further advantage is that the examinee who 
knows scarcely anything about the subject, but who bluffs by Чтее- 
associating’ around an essay-type question, and who often gains 
marks by doing so, is entirely foiled. Older students realize how much 
fairer is the new-type test, and on the whole prefer it to the orthodox 
examination. Younger pupils, also, according to several American 
investigations, enjoy it more. 

A short test is appended below, dealing with the subject-matter of 
the previous chapters in this book, which the reader is advised to 
try to answer. It is not meant to be a complete examination on 
mental testing and statistics (hence it violates several of the rules of. 
construction given in the next chapter). But it introduces several of 
the main varieties of new-type questions, some easy, some fairly 
difficult, and should give the reader who is not already familiar with 
such questions a better insight into them. Nos. 1-8 are simple-recall 
type, Nos. 9-16 completion type, Nos. 17-25 true-false type, Nos. 
26-36 multiple-choice type, Nos. 37-47 matching type, and Nos. 
48-54 rearrangement type. The correct answers are listed on 
p. 258. 


EXAMINATION ON MENTAL MEASUREMENT AND STATISTICS 
Time: One Hour 


Please read the following directions carefully before starting: 

Answer as many questions as possible, in any order. The answers 
Should be written in the right-hand margin on the blank spaces 
provided. Any rough figuring may be done on blank paper. 

The answer to every question (except Nos. 42-7) consists of one 
letter, word, or number. 

You may guess if you are not certain of an answer; but as marks 
Will be deducted for wrong answers, random guessing is not 
advisable. 


1. Eight pupils obtained the following school marks: 


A B C DE F G H 
15 10 12 10 9 12 10 18 


What is their median? " " : ТУР 
2. If the above marks were put into rank order form, what would 
B's rank position be? < a AERE 


МА—15 
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3-5. The following is an extract from the norms for a group intelli- 
gence test: 2 did 
Score . . 48 58 68 77 8 
Mental Age. 9 10 11 12 13 14 15 and adult 
What are the respective intelligence quotients of: 
3. A child aged 8-0 who scores 77? . " Р 
4. А child aged 10-0 who scores 63? 


Percentile Mark 

| BY 

38 ~ 67% 

50 60% 

25 53% 

10 . 47% 

What is his approximate place (i.e. his rank position in the 
class) in: 

6. English? 
7. French? 


8. Mathematics? 


9-16. In each of the numbered spaces in the following paragraph, 
one word or letter is needed to complete the statements. Do not 
write anything in the actual spaces, but opposite No. 9 in the 
right-hand margin write the word or letter which is needed to 
fill up space (9). Fill in the other words or letters similarly. 

he compensation theory of mental or- 
ganization is disproved by the fact that all 9 "^ 
correlations between mental abilities are (9). Mrs 
The results of correlational investigations 10 E 
also disprove the view that the mind is made dE 36 
up of distinct (10). The Two-factor Theory, 
Proposed by (11), regards each ability as 
made up of a factor common to all abilities, 12 
called (12), and a factor (13) to that ability | 
alone, called s. However, when verbal tests, 13. 
performance tests, and other tests of intelli- 
Bence are compared, it is doubtful whether 14. 
(14) will account for all the inter-correlations. 
Residual correlations indicate the presence 
of (15) factors, such as verbal ability, v, and 
practical ability (16). 16. 


17-25. Write a + sign in the margin after any of the 


following 
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statements which are true, and a — sign after those which are 
false. 
17. An omnibus test is a group intelligence test 
in which items of many kinds are mixed up ........ 
18. A majority of very dull children are below 
average in health and physical develop- 
шш. s „ S n.o o 8, i dukana 
19. Performance tests are applied by psycholo- 
gists chiefly in order to measure a child's 
practical ability in everyday life = ж shea. 
20. In order to measure the average intelligence 
of a class of 7-year-old children, a teacher 
would be best advised to use a group test 
with pictorial material . — . . . ....e 
21. In a normal frequency distribution curve the 
semi-interquartile range and the standard 
deviation are identical . — . — . . ........ 
22. University students who obtain scholarships 
on the basis of competitive examinations 
as often as not obtain no better degree 
marks than those who fail to win such 
scholarships. «s s _ a > so see) 
23. The advantage of scoring ‘abilities in ` terms 
of Mental Age or Educational Age is that 
the units of measurement are all equal to 
Ooneanothet, а а e. po а mca) 
24. A group of a hundred children picked ‘out as 
being especially superior in one school 
subject will practically never show as great 
an amount of average superiority in any 
other subject » 8 23 ж _ E aisi 
25. The Probable Error of a pupil’ $ English com- 
position mark is generally likely to be 
smaller than that of his arithmetic mark. ........ 


26. Write the letter (A, B, or C, etc.) of the correct answer in the 
margin. 
26. Group intelligence tests, as contrasted with 
the Stanford-Binet test, аге: . * жо беа 
(A) More varied in content 
(B) Better measures of children's intelligence 
(C) More objectively scored 
(D) More difficult to apply 
(E) Less suitable for application to superior 
adults 


27-32. A set of 51 pupils took three examinations, A, 
B, and C. The following distributions of marks 
were obtained : 


216 


2333 


34. 
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Mark A B р 
574. zm 4 1 
B0 o o e 3 2 5 
T+ = s lI 4 7 
66 + . . 18 7 п 
59 + . e 42. "d 11 
52+ . vc S 14 8 
454. M. 9 4 
А Меж. xod 3 2 
51 50 ï 


27. Which of the distributions approximates 
fairly closely to normal, A, B, C, or 
None? . š , 5 Р " 

28. Which of the distributions is negatively 
skewed, A, B, C, or None? к а 

29. Which of the distributions is likely to 
have the smallest standard deviation? 

30. A certain pupil happened to get 70% on 
all three examinations. In which of 
them did he do relatively best? 

31. If you were calculating the average of 
Distribution C, what figure would you 
take as your arbitrary mean? 5 

32. In calculating this same average, what 
figure would be multiplied by 2 (i.e. 
by the number of pupils who score 
38 +)? . " " 5 А i 

A group intelligence test was applied in a 

secondary school, and the occupational 
preferences of all the pupils were ascer- 
tained and classified under ten headings. 
In order to find whether there is any cor- 
respondence between occupational prefer- 
ence and intelligence, which of the follow- 
ing statistical techniques could best be 
applied? : Я Е 

(A) Biserial r. 

(B) Product moment correlation. 

(C) Analysis of variance. 

(D) Contingency coefficient. 

(E) Correlation ratio. 

(F) Multiple correlation. 

Which is the least serious of the following 


defects in intelligence quotients obtained 
from group intelligence tests? . : 
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35. 


36. 


(A) Group test scores are considerably affec- 
ted by previous practice on similar tests. 

(B) The standard deviation of group test 
I.Q.s differs considerably in different 
tests. 

(C) Few of the published tests are adequately 
standardized, or have reliable norms. 

(D) Group tests are generally given with a 
time limit, and the child who is quick- 
est is not necessarily the most intelli- 
gent. 


Which оле of the following statements is un- 
true? The distribution of marks given by 
an examiner to a class of fifty pupils may 
be very irregular or skewed because: 

(A) The number of cases is too small for a 
smooth, normal distribution to be 
expected. 

(B) The examiner's opinion of the relative 
merits of the scripts is a personal one. 

(C) The pupils’ abilities may be asymmetric- 
ally distributed, as when, for example, 
the class contains a long tail. 

(D) The scale of marks which the examiner 
adopts may be lacking in any rational 
basis. 


In an experiment on the scores of graduate 
and non-graduate teachers on verbal and 
non-verbal group intelligence tests, these 
results were obtained: 


Graduates . 
Non-graduates . . 1435 77-0 
Р.Е. of the difference 

between means. А 2-1 12 


Which of the following conclusions may legi- 
timately be drawn? = " у 

(A) Graduates do better than non-graduates 
at intelligence tests. 

(B) АП teachers do better at verbal than at 
non-verbal tests. 

(C) Scores on verbal tests are improved by 
taking a university degree. 

(D) Teachers with university training have 
no appreciable advantage over others 
on non-verbal tests, 


Verbal Tests Non-verbal Tests 
» 1539 78:6 
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F 
F 
LQ. LQ 
(A) (B) 
F F 
(С) га. (0) LQ. 


37-41. The above pairs of curves (A-D) represent the approximate 
frequency distributions of the intelligence quotients of various 
groups of children. Below аге listed several such groups. 
Opposite four of them write the letter of the pair of curves 
which corresponds. Write *None' opposite the groups to which 
no pair of curves corresponds. 

37. One thousand elementary school pupils, and 
one thousand special School pupils . = 
38. All elementary school pupils and all special 
school pupils in the count А s š 
39. All elementary school pupils and all children 
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this piece of work. Two letters may be written if two psycholo- 
gists were prominently associated. Some letters may be used 
twice or more, some not at all. 


(A) Ballard 42. Development of the con- 
(B) Binet ception of the Intelli- 
(C) Burt gence Quotient . —— 
(D) Drever-Collins 43. Development of a scale 
(E) Terman of performance tests . ........ 
(F) Thurstone 44, Investigations by corre- 
lations into the factors 
underlying abilities . ........ 
45. Advocacy of new-type ex- 
aminations. * неми: 
46. Construction of educa- 
tional attainments tests ........ 


47. Re-standardization of the 
Binet-Simon scale & еа аяне 


48-54. An arithmetic test, a handwriting test, and two group 
intelligence tests were given to an ordinary school class, and 
the following correlations between test scores were worked out: 


(A) Combined intelligence tests with handwriting test. 
(B) Combined intelligence tests with arithmetic test. 
(C) Combined intelligence tests with chronological age. 
(D) One intelligence test with the other. 

(E) Arithmetic test with handwriting test. 

(F) Handwriting test with chronological age. 

(G) One intelligence test with arithmetic test. 


Which pair would you expect to yield: 


48. The highest correlation?  . 5 i УЛО, 
49. The second highest? . 4 à у ЖОЛУ 
50. The third? à * é E к РАТ 


51. The fourth?  . > А 4 5 зе аи С 
52. The fifth? à " " Ў * ке лоша ВАТ 
53. Тће sixth? к 5 5 Е E 
54. The lowest correlation? . а 5 w^ KG PS 


DEFECTS OF NEW-TYPE EXAMINATIONS 

So far only the advantages of new-type examinations have been 
mentioned, and we must now consider criticisms of them in detail. 
Two minor objections will be admitted at once. First, the cost of 
printing or duplicating a new-type test is considerably greater than 
that of an essay-type examination. For classroom purposes, however, 
oral presentation of the questions can often be adopted. If several 
groups of pupils or students are to be examined, they can write their 
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answers on blank sheets, and their test papers can be used several 
times over. 

Secondly, much greater time and skill are needed for constructing 
а good new-type test. The major criticisms which are to be discussed 
below apply only too often to inferior tests which have been rapidly 
constructed by inexperienced amateurs. It is highly desirable that 
teachers should be trained in the principles of testing during their 
training college courses. The time needed in setting a paper is, how- 
ever, more than offset by the time saved in marking, when the num- 
ber of examinees is large. Indeed, since the marking should be fool- 
proof, there is no reason why it should not be done by clerks or 
secretaries. In large public examinations, the authorities could save 
much expense by paying professional exarainers only to set, not to 
correct, the papers. The present writer finds that the construction of 
а one-hour's new-type test in educational psychology (such as that 
appended above) takes him about twenty hours. Setting an essay- 
examination scarcely needs one hour. But in marking these one- 
hour examinations he can easily do thirty of the former as against 
ten or less of the latter scripts per hour. He, therefore, finds no saving 
in total hours of work unless the number of examinees approximates 
to 300 or more. But the setting of a new-type test is a fascinating 
occupation, which can be done in odd moments throughout the 
year; and the marking is simply a routine matter which involves по 
mentalstrain. By contrast, the marking oflarge numbers of essay-type 
scripts in psychology is the most trying work that he ever has to do. 

The Guessing Factor.—Next we must admit that examinees 
obtain a certain number of correct 
test) by pure chance guessing. If they filled in answers at random 
they would, on the average, achieve half-marks on the questions 
where two responses are provided (such as the true-false type), and 
quarter-marks on the questions where four responses are provided. 
This fact disturbs the teacher far more than it does the psychologist. 
For the latter realizes that every examinee has the same chance of 
scoring, let us say, 35%, by random guessing; he therefore regards 
35% as the effective zero point and only gives credit to scores above 
this level. An alternative procedure, which comes to much the same 
thing, but is fairer in that it does not penalize the examinee who omits 
questions rather than guess them, is to subtract the total wrong 
Tesponses in true-false questions from the total right ones; to sub- 
tract half the total wrong in three-response questions, one-third in 


may 
answers (in a recognition-type 
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four-response questions, and so on. That is to say, an examinee's 
, where R is the total right, 


score corrected for guessing = R — EM 


W the total wrong, and n is the number of alternative responses 
provided in each question. In general practice the correction is 
omitted in four- or more response questions, since it is so small. 
But the problem of guessing is a complex one which cannot be 
entirely settled by this mechanical correction. It has aroused wide- 
spread controversy, and a long discussion of it may be found in Ruch 
(1929). Although the above correction will compensate for the 
effects of guessing in the average examinee, yet it will not do so in 
all cases, since some will be luckier than others, in accordance with 
the principles of probability. Suppose that a large group of persons 
guesses the responses to ten true-false items ; the average person will 
get 5 of them right, and two-thirds of the group will get either 4, 5, 
or 6 right.? Only about one person in a thousand will be so lucky as 
to get 10 right, or so unlucky as to hit on none of the right answers. 


` If the guessing correction is applied, the scores corresponding to 10, 


6, 5, 4, and 0 right will be +10, +2, 0, —2, and —10 respectively. 
In a longer test the range of chance scores which may arise from 
guessing will be still greater, although their average (when corrected) 
will still be zero. Undoubtedly, then, we must expect true-false tests 
to be somewhat unreliable, but the effect will be less marked in tests 
composed of three- or four-response items. Experimental evidence 
does indeed show true-false tests to be less reliable than multiple- 
choice tests; and the latter are less reliable than tests made up of 
recall items. But the difference between them is not large, and it may 
easily be offset by the fact that a true-false test can contain more items 
than a multiple-choice test which occupies the same length of time, 
since true-false items can be answered more quickly. Probably a 
good deal depends, also, on the skill of the tester in setting each 
particular type of question. 

Now, though we must allow that a considerable error may occur 
among the scores of a small proportion of examinees on account of 


1 Note that there is no point in making any guessing correction when all 
examinees attempt an answer to every question. Their relative marks will be 
affected by the correction only when they omit some questions. In general, then, 
mae becomes less necessary the longer the time allowance for the examina- 

ion. 

® The situation is effectively the same as getting the persons each to toss a 
end ten times, and counting up how many of them obtain 0, 1, 2,..., 9, 10 
heads. 
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guessing, in actual practice entirely random guessing (which was 
assumed in the previous paragraph) will very seldom occur. The 
examinee is more likely to possess an incomplete knowledge of 
certain items, sufficient, however, to make his guesses right ones. 
Teachers may regard it as unfair that such an examinee will obtain 
the same marks on those items as will the more able examinee who 
really knows the right answers. But a brief study of recognition-type 
questions, such as those given above, will show that the examinee's 
shaky knowledge may not always be to his advantage. For the 
alternative (wrong) answers are usually worded sufficiently plausibly 
to deceive the weak student, and should indeed be based on common 
misconceptions. Hence his ‘semi-guesses’ will in practice be very 
frequently wrong. For this reason experimental researches indicate 
that the guessing correction fully compensates, or even over- 
compensates, for any advantages which the weak student might gain 
by guessing. Other investigations prove that the amount of guessing 
may be greatly reduced, and the reliability and validity of the test 
slightly increased, by instructing examinees not to guess, and by 
warning them that Wrong guesses will be penalized, 

To sum up: random guessing may considerably distort the marks 
on a recognition-type test, especially when the questions provide 
only two alternative answers, It does not actually do so to any great 
ever is random, partly be- 
ructions, and partly because 
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The Elementaristic Tendency in New-type Tests.—We come now to 
the most obvious and most frequent objection to new-type tests. It 
is said that they measure only the most trivial aspects of ability, such 
as details of information. The tests may indeed analyse complex 
mental operations into elements which can be objectively marked, 
but they cannot put the elements together again, and in the process 
of analysis they omit many of the most important aspects of ability. 
They cannot, as can the essay-type of examination, show the ex- 
aminee's general understanding of the subject, nor his interpretation 
of facts, his capacity for organizing and formulating his knowledge, 
nor his initiative and originality. Far from reducing the tendency of 
examinees to cram mechanically, they may increase the amount of 
rote memorization and discourage any desire to achieve a general 
grasp of the field. Investigation has, indeed, shown that students revise 
their work in a different manner when they know they are to sit a 
new-type examination, and that they concentrate on acquiring such 
bits of knowledge as are most likely to appear in the test questions. 
It is often alleged, though without proof, that secondary pupils who 
have been trained in the primary schools to pass objective selection 
tests find considerable difficulties in adjusting to more normal 
forms of school learning. 

Such criticisms embody a number of half-truths, and may be 
countered by a number of arguments. The retort may first be made 
that even if the new-type test cannot measure ‘understanding, 
originality, etc.,’ the essay examination cannot do so either. For 
when examiners come to assess such qualities, they disagree widely 
with one another. The orthodox examination achieves a reasonable 
degree of reliability when it most closely approximates to a new-type 
test, i.e. when it asks primarily for factual data, or when the ex- 
aminers draw up a detailed analytic scheme of the points which they 
wish to mark. Secondly, it may be pointed out that examinees en- 
gaged on an essay examination do not usually shine in regard to 
organization of knowledge and original thinking. Rather they dash 
down all the ideas and facts which they can recall, many of them 
irrelevant. ‘Thinking’ as contrasted with ‘reproduction of items of 
information’ is at a minimum. The new-type test taps all the relevant 
knowledge, with much less waste of time, and prevents the candidates 
from producing irrelevant material. 

Such recriminations are also, however, only partially justified. 
Much sounder is the argument which points to weaknesses in the 
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critics’ psychological analysis of the contrasted examinations. They 
commonly draw a sharp distinction between information and repro- 
duction, on the one hand, and understanding and thinking, on the 
other hand. They assume, that is, that these mental processes are 
uncorrelated. In actual practice the two cannot be separated so 
readily. The acquisition and reproduction of information always 
involves a certain amount of thinking, and no thinking is possible 
unless a person possesses information to think about. Hence, 
experiments in which attempts have been made to measure these 
faculties separately have usually shown them to be highly inter- 
correlated. It follows, therefore, that even when a new-type test is 
apparently measuring nothing but information, it is at the same time 
providing a pretty good measure of other more complex types of 
ability. It cannot ask examinees to ‘Discuss . . ., Compare . . ., etc. 
nor make any overt provision for independent thinking, yet it is 
certainly drawing to a considerable extent upon these ‘higher’ mental 
processes which the orthodox examiner desires to assess, Conversely, 
a large proportion of the average essay-type answer consists of the 
Sort of material which critics affect to despise when it is accurately 
measured by new-type tests, | 

The overlapping between the two kinds of mental processes is 
further demonstrated by the fact that a new-type response, or a sen- 
tence in an essay answer, may depend upon ‘understanding’ in one 
examinee, but may result from ‘facile reproduction’ in another 


n more opportunity of displaying ‘higher’ 
mental processes in essay-type than in most new-type examinations. 
And we must remember that, even if they do not 


bits of information are much the easiest to Set, and, therefore, do 
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only too frequently constitute the bulk of home-made examinations. 
But there is nothing inherent in new-type examining which makes 
better questions unattainable. And the writer would claim that at 
least half the questions in his own test, above, do demand not only 
the possession of information, but also the application of principles 
or the interpretation of information; that is to say, ‘understanding.’ 

To conclude, then, the defect which we have been discussing is one 
Which may occur in new-type tests, but need not do so to any great 
extent. Its effect upon the validity of the tests has been greatly ex- 
aggerated by critics owing to their faulty views of the nature and 
organization of mental abilities. 

Defects Due to the New-type Medium of Expression.—Our next 
objection is of a different order. It does not try to point out some- 
thing which new-type tests fail to measure, but rather to show that 
they measure a good deal over and above the abilities which we 
actually want to assess; and that this extra factor, having little 
educational significance, seriously distorts and reduces the validity 
of all new-type test results (cf. pp. 144, 168). Some of the most ardent 
advocates of new-type testing, e.g. Ballard (1923), do not seem to 
have taken this criticism into account. An excellent account of it may 
be found in Hawkes and Lindquist et а! (1937). 

If the reader will analyse carefully the processes which take place 
in his own mind when answering the new-type questions above, he is 
likely to find a good deal going on besides the recollection and 
application of his knowledge of the subject. He must first study the 
instructions to discover what he has to do in each question, and then 
manipulate his knowledge in such a way as to fit itin with the require- 
ments of the question. In a multiple-choice question he may arrive 
at the right response, not because he is certain that he knows it, but 
because he has eliminated the incorrect ones. Sometimes a single 
Word in the question or in the provided response will enable him to 
choose correctly, without his having to think out the full implications 
of all that is printed. Some questions may, unknown to the author, 
contain clues to the correct responses which do not require any 
knowledge whatsoever of the subject. But if the reader is a sufficiently 
Sophisticated new-type examinee, he will at once take advantage of 
Such faults in construction. Examples of these irrelevant clues are 
given in the next chapter. Unfortunately, the questions which are 
most liable to contain such errors are those that aim to evoke 
‘understanding’ ratherthan ‘reproduction.’ Straightforward questions 
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which ask primarily for factual information (e.g. Nos. 27-33 
and 42-47) are generally free from them. But more elaborate ques- 
tions (e.g. Nos. 34-41, 48—54) involve a great deal of 'disentangling' 
and complex inference. The ability to do such questions may depend 
rather largely on the examinee's previous experience and practice 
with such tests, or upon some unknown group factor, and not 
solely on his knowledge of the subject being tested. 

Such sources of unreliability and invalidity may be much reduced 
if the examiner is sufficiently skilled in anticipating all the diverse 
mental processes which examinees may employ, but they are likely 
to be eliminated only if he falls back on questions of the simple- 
recall type, and asks merely for scraps of information. Actually no 
examiner can foresee all the possible weaknesses in his questions, for 
investigation shows that even tests constructed by experts may con- 
tain items whose validity is very low, or sometimes negative. That is, 
these items are done no better, and Sometimes worse, by good than 
by poor students. A difficult multiple-choice item, for example, may 
contain wrong alternatives which are sufficiently plausible to deceive 
the majority of good students, but which are beyond the compre- 
hension of the poor students, who consequently just guess, and often 
guess rightly. Or the correct answer may contain some unsuspected 
ambiguity which is not noticed by mediocre students, who therefore 
choose it, but which is noticed by good students, who therefore reject 
it. Such faults in construction are still more likely to occur in tests 
devised by amateurs, 

It was shown in the previous chapter that many of the defects of 
the orthodox examination derived fr 
to formulate or express their 
medium, namely essays, 
pretation and evaluation 


-type examinations. Examinees 
ability in a different but perhaps 
‘roundaboutness,’ and the com- 


‚ the examinee has far more reading to do. 
The Subjectivity of New-typ 
tests, which has been entirely n 
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that they are, in a sense, just as subjective as essay-type examinations. 
Only the personal element enters, not in the marking, but in the 
setting of the questions. Pullias (1937) has demonstrated that when 
two or more examiners construct their own new-type test papers, 
intending to cover precisely the same field of knowledge and ability, 
and apply them to the same pupils or students, the average correla- 
tion between the two sets of marks is only of the order of -- 0:50. 
This figure may be unduly low because the examiners were teachers 
Who may not have been highly skilled in test construction. Yet some 
thirty-five different teachers took part in the experiment, all of whom 
habitually employed new-type tests, and the results were very similar 
in different school subjects, and among pupils of various ages. Pullias 
further found that standardized educational tests, which were alleged 
to measure ability in the same school subjects, only yielded an 
average inter-correlation of 4-0-68. In some instances the co- 
efficients were decidedly higher (4- 0-8 to +- 0-9), in others very much 
lower (+ 0-2 to + 0:3). 

How may we account for such results? Actually the sources of un- 
reliability described on pp. 200-7 apply here, too. These were: 

1. Actual changes in the examinees, or function fluctuation. 

2. Inadequate sampling. As already mentioned (p. 212), this 
Should be much less prominent in new- than in essay-type examina- 
tions, especially if the instructions given at the beginning of the next 
chapter are followed. The trouble is not so much lack of thorough- 
ness of the questions as the subjectivity of their selection. 

3. Statistical defects. These can arise, not because different exam- 
iners adopt different scales of marks, but because the average level, 
and the range, of difficulty among their questions may differ. Such 
inconsistencies can, however, readily be eliminated, and they did not 
affect Pullias's correlations. 

4. Subjectivity. The essay-examination is subjective because the 
examiner has to analyse the field of ability and decide which aspects 
are most important, and his personal opinion determines which 
characteristics of the examinees shall be assessed as good or bad. The 
new-type examination is subjective because the examiner has to per- 
form a similar process of analysis in choosing questions, and in pro- 
viding the good and bad answers between which the examinees are 
to discriminate. Again, the essay-examiner is unable to maintain 
objectivity because he must translate the answers presented to him, 
or ‘penetrate through the medium of expression’ in order to discover 
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the examinees’ real merit. The new-type examiner carries out an 
equivalent translation when formulating his questions. Hence differ- 
ent examiners will formulate questions about a given topic differ- 
ently, in just the same way that different essay-examiners will trans- 
late differently the answers to an essay-question on this topic. 

Pullias points out that the main stages in examining—(a) con- 
structing questions, (5) answering questions, (c) evaluating answers— 
in reality constitute a single whole. The new-type examiner has, in- 
deed, eliminated subjectivity from (c), but at the cost of greater sub- 
jectivity in (а). And his personal opinion has inserted itself into (5), 
since his questions embody a good deal of what, in essay-examina- 
tions, is normally carried out by the examinees. 

Now that we recognize this subjectivity as the chief flaw in new- 
їуре tests, we should be able to control and reduce it more readily 
than in essay-examining. Nevertheless, we must certainly conclude 
that both types of examination are imperfect mental measuring in- 
Struments, and open to much the same objections. Conclusive 
evidence as to their relative validity is not easy to obtain, since usually 
it is only possible to compare their marks with the marks on sub- 
sequent, equally fallible, examinations. But by pooling the marks 
from several papers of both types, from class teachers’ estimates, etc., 
a fairly good criterion may be achieved, And according to this, 
neither the essay-type nor the new-type is definitely superior (cf. 
Lee and Symonds, 1933). The evidence of the superior validity 


ward is quite uncon- 


cel one another out, 
that of either of the 
lastic subject may be 
5 by the other, each 
is best adapted. 


CHAPTER XIII 


CONSTRUCTION OF NEW-TYPE EXAMINATIONS 


THE first stage in the construction of a new-type examination should 
be the preparation of a detailed statement of what the examination 
is intended to measure. A list of the topics which it is to cover should 
be written out, and a rough assessment made of their relative impor- 
tance, for example on a 1 to 5 scale. Suppose that the sum of all these 
assessments comes to 75, and that the examination is to last two 
hours. Probably about 150 questions will be needed. In that case. 
each topic assessed as 1 in importance should eventually be covered 
by two questions, each assessed as 5 by ten questions, and so on. 
Naturally this distribution of questions need not be enforced mech- 
anically; but the principle of allotting most questions to the most 
significant parts of the field is important. If the test is designed simply 
to show the acquaintance of students with a particular textbook, the 
questions dealing with each topic may be proportioned to the number 
of pages on the topic in the textbook. 

We have already stressed the point that questions should not deal 
with trivial details merely because these are the easiest to set. It is 
very unwise also for teachers to take sentences straight out of text- 
books and turn them into true-false or completion items. If this is 
done, the examinees will assuredly begin to study their textbooks 
with an eye to such sentences, and will try to learn them off by rote 
rather than to understand them. Textbook sentences may occasion- 
ally be adapted as wrong answers in multiple-choice questions, since 
they may then trap the parrot-learner. 

When the teacher and the examiner are one and the same person, 
Suitable questions, or ideas for questions, are likely to occur to him 
while he is conducting the work of the class. These should be noted 
down and filed; they will save him much time and effort when the 
examination has to be made up. No examiner should expect to be 
able to devise a complete test at one sitting, unless it is intended to 
last only about a quarter of an hour. He will soon discover for him- 
self his speed of work, and will set aside sufficient hours, on a number 
of different days, to enable him to carry out the construction 

M.A.—16 


230 THE MEASUREMENT OF ABILITIES 


efficiently. The time needed will, of course, be much reduced if he 
can take over some ofthe questions from previous papers which he has 
set in the same subject. This is often possible since the papers are 
usually collected at the end of the examination, and not allowed to 
circulate freely round the school or.university. Before he enters upon 
the final stage of construction, he should have in hand roughly double 
the number of questions he is likely to need. Many will have to be 
rejected for one reason or another, hence he requires plenty of spare 
ones, 

To reduce the subjective element in setting, it is desirable for 
several examiners to co-operate in supplying questions. In an institu- 
tion or department where examinations are frequently set to large 
numbers, it is useful to build up a pool of questions from which a 
number of new-type papers can be drawn as the need arises. Each 
question is recorded on a card in a classified file, with details as to 
the examinations in which it has been used, and any data regarding 
its difficulty and validity. Periodically, questions that have been used 
too often are withdrawn, and fresh ones added to keep up with any 
alterations in the course of instruction. 

The choice of questions for any one paper is governed by quite 
а different conception of difficulty from that which prevails in tra- 
ditional examinations. In accordance with the principles given in 
Chap. II, the test should be so arranged that the poorest examinees 
will obtain marks very little above 075, the best ones very little 
below 100%, and so that the average will be as near as possible to 
5075. (These figures may then later be converted into some more 
conventional scale.) А few questions, therefore, must be so easy 
that almost all (say 95%) can manage them, a few so difficult 
that scarcely any (say 5%) can manage them. Moreover, all inter- 
mediate grades of difficulty must be adequately represented. By 
contrast, all the questions in an ordinary examination are usually 
arranged to be roughly equivalent in difficulty. It follows that 
in a new-type examination, several sections of the work may be 
omitted altogether if they are likely to be known so well that more 
than 95% of the students would answer questions on them correctly. 
It is a pure waste of time to include ite 
Conversely, sections of the work may have to be omitted if even the 
best 5% are unlikely to be able to answer questions dealing with 
them ; such questions would also be useless. 


Thus the examiner’s next step must be to grade each of his tentative 


ms which everyone can do. 
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questions for difficulty, i.e. to assess what proportions of examinees 
are likely to get each one right. The grading may be quite coarse, e.g. 
А = 5 to 20%, B = 21 to 40%, С = 41 to 60%, D = 61 to 80%, 
E = 81 to 95%. These estimates are likely to be highly inaccurate at 
first, and to diverge widely from similar estimates made by other 
examiners. But if he takes the trouble to check his predictions after 
each paper, and notes how many examinees did actually pass each 
item, his capacity will soon improve. The final version of the paper 
should contain approximately equal numbers of A, B, C, D, and E 
items.! 

What particular form of new-type question to adopt depends en- 
tirely upon the subject-matter. Some things are more readily ex- 
pressed in one form, some in another. No certain evidence of the 
superiority or inferiority of any typeis as yet available. Experiments 
on this matter have been carried out; but the skill of the authors of 
the items has not been controlled. Suppose, for example, that mul- 
tiple-choice items are found to be more reliable and valid than true- 
false ones, this may merely show that the experimenter is himself 
better at compiling multiple-choice items. Nor can we definitely 
claim that different types measure distinctive abilities. The correla- 
tions between two multiple-choice tests, or two true-false tests, do 
not appear to be higher than those between a multiple-choice and a 
true-false test. This indicates that they both measure the same thing. 

There may, however, still be some advantage in probing the mind 
by a variety of techniques. And several different types may be em- 
ployed in a long paper so as to reduce monotony. In a short paper 
this is undesirable, since each type needs to be prefaced by full in- 
structions, and examinees may have to spend too large a proportion 
of time in reading the instructions, and in readjusting their ‘mental 
set.’ The following rule may be taken as a rough, though not a rigid, 
guide. If matching items are included, there should be at least five of 
them, each including 5 to 10 responses. If multiple-choice items are 
included, there should be at least ten of them. If true-false, simple 
recall, or completion items are included, there should be at least 
twenty of each. The numbers of other types of questions should be 
comparable to these figures. 


1 If, however, the object of the examination is to select or cut off, say, the best 
20%, or the poorest 30%, then maximum efficiency will be obtained by aiming 
as many questions as possible at these respective levels of difficulty, and not 
en about their suitability for spreading out the marks of the majority 
cf. p. 31). 
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When the questions have all been chosen they should be rearranged 
so that all those of any one type are grouped together. In each 
group the questions should be arranged in estimated order of 
difficulty, starting with the easiest. If this precaution is omitted, the 
poor and medium examinees may waste a lot of time attempting 
difficult items at the start, and fail to reach easier ones which they 
could do, and the reliability of their results will be reduced. 

We will now consider each of the main types. 


SIMPLE RECALL AND OPEN-COMPLETION TYPE 


A question or short problem is followed by a blank space where 
the answer is to be written. Alternatively certain words or short 
phrases are omitted from a sentence or paragraph, leaving blank 
spaces to be filled in. For example :' 

ТА. Who was the inventor of the steam engine? ........ 

Тв. The inventor of the steam engine was ... 

2. Ify = 2x a 3, what w27 

x dx 

3. Ohm’s Law states that ........ х Resistance = ........ 

4. When a salt is dissolved in water the molecules break up 
MOETE es. A positive electric charge is carried by the ........ > 
a negative charge by the ........ A balanced action takes place 
whereby XY 2* ........ TRE ian theory provides an explana- 
tion of the......... of electricity by salt solutions. 

5. A geography test may consist of a map on which numbers, 
but no names, are printed. Below is a column of these numbers and 
the examinee is instructed to write opposite each the town or river 
Which it represents. The same technique can be used for testing 


knowledge of the names of parts of some instrument, mechanism or 
biological specimen. 


Each correct answer receives one mark; thus 6 marks may be 
Scored on Question 4. 
. The advantages of this type of item are, first, its naturalness and 
its similarity to ordinary questioning; secondly, its freedom from the 
chance guessing effects which occur in closed or recognition items ; 
and thirdly, the facility with which it may be constructed. This 
facility, however, is somewhat delusive, and unsuspected flaws are 


~ The author feels that an apology may be needed for the puerility of some of 
the illustrative questions in this chapter. They are intended to show the applica- 
tion of the technique to various school and university subjects, at varying levels 
of difficulty. Preferring to construct fresh ones rather than to quote from pub- 
lished tests, he is limited by the rustiness of his own education. 
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very likely to creep in. The chief disadvantage of the type is that its 
scope is practically limited to rote knowledge, or skill at simple 
arithmetical and scientific problems. 

A number of hints on construction may be listed as follows. 

(a) Try to ask only for important items of knowledge. 

(b) Ensure as far as possible that there is only one right answer, 
and therefore confine the answers to single words or numbers, or to 
very short phrases. Some of the blanks in Question 4 violate this 
rule. The last one might be filled by ‘conduction,’ ‘carriage,’ or 
‘carrying.’ If there is any likelihood of alternatives, the examiner 
should list them all beforehand, and decide which ones to allow as 
correct. In this instance he would have to decide whether to disallow 
‘passage.’ Examiners may find it easier to conform to both these rules 
if they think of the answer first, and then build up the question or 
sentence around it, in such a way that the answer is completely 
delimited. 

(c) Keep the questions and sentences as short and as unambiguous 
as possible, and in particular avoid over-mutilation of a completion 
sentence by inserting too many blanks. A completion sentence should 
always be looked on as a form of question. See therefore that sufficient 
words are left to make it an intelligible question. For the same 
reason, do not put blanks near the beginning of a sentence, which 
can only be answered by consulting later clauses of the sentence, or 
later sentences in the paragraph. Most of the blanks should be at the 
end of the sentences. 

(d) Study the whole content of a sentence or question to see that 
all of it is essential for the production of a correct response. In 
badly constructed items, it may be possible to deduce the response 
simply by the exercise of intelligence, without any knowledge of the 
subject-matter. In others the correct response may be given away 
somewhere in a later sentence; or hints may be derived from the 
grammatical construction. For example, blank spaces should not be 
preceded by ‘a’ or ‘an.’ An irrelevant clue is provided by the ‘an’ in 
the following example: 


6. A piece of land entirely surrounded by water is called an 


This difficulty could be overcome by wording the sentence as a 
question, or by printing 'a(n)........ 
(е) Do not vary the length of the blank space according to the 
size of the answer expected, as this would also supply irrelevant 
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clues. See that the spaces are big enough for examinees with large 
writing. 

(f) For purposes of easy marking, it is desirable to arrange for 
answers to be written in the right-hand margin. It is much more 
difficult for the eye to pick out answers scattered all over the page. 
Even the following is rather inconvenient: 

Write down the author of each of these plays. 

7. Тће Admirable Crichton ........ 

8. Candida ........ 

One solution is to number the blank spaces and to print a separate 
column of numbers opposite which the answers are written, thus: 

Write down the authors of each of these plays. 

7. The Admirable Crichton is баалан 

8. Candida б ence ste 
A paragraph-completion test, with scattered answers, may be scored 
more efficiently by constructing a stencil of cardboard, with holes 
through which only the answers are visible when the stencil is laid 
over the test paper. Under each hole on the stencil is written the 
correct answer. 


TRUE-FALSE TYPE 


This generally consists of a set of statements, approximately half 
of which are true, the rest false. The examinee has to indicate which 
is which. In a spelling test the ‘statements? may be reduced to single 
words, whose correctness or incorrectness is to be checked. For 
example: 


Write a + after each of the following words which is correctly 
spelled, and a — after each one wrongly spelled. 

9. Medecine ........ 

10. Embarrassment 

11. Accelerate ........ 

ION SEZ „эзан 


d A modification which is useful for tests of spelling, grammar, and 
literary knowledge is to instruct the examinee to cross out, or under- 
line, the one word in each of a set of sentences which is ungram- 


matical, or wrongly spelled, or wrongly quoted. One example of 
each is given.! 


1 These should perhaps be classed as multiple-choice items, since the examinee 
has to pick out the incorrect word from about half a dozen correct ones. 
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14. I met him and she in the town. 

15. The ruffian demolished all the aparatus in the laboratory. 

16. Rule, Britannia ! Britannia rules the waves. 

The ordinary true-false type has been commonly used in tests of 
all subjects, because it is capable of sampling very quickly a wide 
range of knowledge, and because examiners find it easy to construct. 
But the uncertainties due to guessing (cf. pp. 220-2) and, still more, 
its liability to unsuspected errors, have recently made it less popular. 
Some recommend that it should only be used when the multiple- 
choice type is ruled out for lack of plausible alternatives. Thus the 
following are justified since each involves only two possible answers: 

17. Distance east or west of Greenwich is 

known as longitude. True False 
18. When summer time starts the clocks 
are put back one hour. True False 

There is no need to confine true-false items to rote information. 
Indeed, questions of general principle, of interpretation and the 
like can well be expressed in this form. Great care is needed, how- 
ever, to see that they do not violate some of the following principles 
of construction. 

(a) Statements should be simply worded, otherwise they are 
liable to contain sections which are true and sections which are false. 
In general it is best to use sentences which contain only a single 
clause, embodying only a single idea; to word them positively ; and 
to refrain both from negatives and from conjunctions such as 
‘because’ and ‘therefore.’ The following examples are ambiguous, 
and therefore unsatisfactory. 


19. Napoleon lost the battle of Waterloo 


because his army had insufficient ammunition. True False 
20. Beethoven was a great composer who 
wrote many symphonies, operas, and sonatas. True False 


The most crucial part of a statement, which renders it true or false, 
should usually come near the end. 

(b) It is not easy to construct difficult statements, nor to estimate 
their degree of difficulty beforehand. They need to be sufficiently 
plausible to deceive the examinee whose knowledge is incomplete, 
and therefore not too obviously true or false. And the true ones 
should be just as likely to appear false as the false ones true. At the 
same time they must be clearly true or false to the able examinee, and 
therefore must not contain ‘tricks.’ Good students look out for 
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essentials, and so may be tripped up if errors are introduced into 
minor details. The following example is bad for this reason. 

21. Water boils at 212? and freezes at 32° C. True False 

(c) It has been found that certain forms of wording are much 
more apt to be used in true than in false statements, or vice versa. 
Examinees soon get to recognize these, and so may answer correctly 
without any knowledge at all ofthe subject-matter. The great majority 
of statements, in amateur new-type tests, which contain the words: 
“АП, ‘Always,’ ‘Never,’ ‘Impossible,’ are false. The majority of those 
which contain ‘Usually,’ ‘Often,’ and the like, are true. Most very 
long statements also tend to be true more often than false. There is 
no need to discard the above words altogether, but they should be 
used with caution and, if possible, introduced as frequently into 
true as into false statements. 

(4) If the number of statements is less than, say, twenty, there 
should not be an exactly equal division between true and false, else 
examinees will count up to make sure that they have checked them 
half and half. . 

(е) The order in which the statements appear must, of course, be 
entirely random. If the examiner never prints more than two true or 
two false statements consecutively, examinees may again discover 
this and take advantage of it. One way of arranging is to toss a coin. 
If it shows heads, take the easiest of the true items as No. 1. If the 


next four tosses are tails, take the four easiest false statements as 
Nos. 2 to 5; and so on. 


(f) Various methods for re 


Ма cording the responses have been used : 
underlining the words True о 


т False, or Yes ог Мо, drawing а circle 
round the letters T or F, or writing the letters T ог Е. One of the 


simplest and most satisfactory is to write a + ora — in a blank 
Space at the end of each statement. These spaces should all be 
aligned in a straight column in the right- or left-hand margin. 

(2) The correction for guessing, described on p. 221, should be 


applied, and examinees should be warned of this in the general 
instructions at the beginning of the test, 


MULTIPLE-CHOICE TYPE, INCLUDING BEST REASON AND 
MATCHING ITEMS 


A large number of variants of this type may be distinguished, 
examples of most of which are quoted below, 


(i) The number of alternative responses is usually five, but it may 
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range from seven to three, or even two, according to the number of 
possibilities implicit in the question. Six- or seven-response questions 
occupy rather a lot of space. Three- or two-response questions should 
usually be avoided because they require the application of guessing 
corrections. In any one new-type test it is convenient, though not 
really necessary, to provide the same numbers of responses for all 
the questions. 

(ii) When the responses consist of single words, they can be 
printed all on one line, consecutively. They should be enclosed in 
brackets, and preferably be printed in italics. The examinee is then 
instructed to underline the one word in each set of brackets which is 
correct. For example: 

22. They (was, were) going (to, too) London (too, to) visit (their, 

there) aunt, (who, what) lived (there, their). 
When, as in most of the remaining examples, the responses consist 
of several words or phrases, underlining is troublesome both for the 
examinee and for the marker. The responses may sometimes be 
printed consecutively and numbered, and the examinee writes the 
number of his choice in a blank space. For example: 

23. What is the function of the sacral autonomic nervous system? 
(1) To increase circülation and respiration. (2) To control posture 
and movements of the limbs. (3) To increase the activity of the 


reproductive organs. (4) To contract the sphincters of the excretory 
organs. (5) To inhibit the activity of the alimentary canal. 


-—— 

This type of question is, however, inconvenient to read, hence it 

is more usual, as in the examples below, to print the responses in a 

column. The examinee indicates his choice by a cross in one of the 

blank spaces in the margin. Unfortunately this method tends to 

consume a lot of space. 

` (iii) The items may be stated in question or in completion form. 

Thus Question 23 might equally well commence: 


( 23. The function of the sacral autonomic nervous system is: 
Deak 

(iv) In the ‘best reason’ type, the examinee is instructed to check 
the best of a set of reasons for a given statement. Occasionally a 
more suitable item may consist in finding the worst reason. For 
example: 


24. Many investigations indicate that children’s intelligence is 
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somewhat affected by environment and education. Which of the 
following provides the poorest evidence for this conclusion? 
The correlation between the intelligence of orphans 
and their parents is lower than that between 
children and their parents who have reared them  ........ 
The average difference between the I.Q.s of pairs of 
identical twins is higher among those reared apart 
than among those reared together . х 4 
Children tested before placement in foster homes, 
and again after several years, show a greater in- 
crease in intelligence in the better than in the 
poorer homes š Е ў $ | А 
Children of professional parents are on the average 
markedly superior in intelligence to children of 
unskilled labourers Я 5 + Š 


х... 


(v) Two or three responses, instead of one, may be correct. 
Occasionally none of them may be correct. The instructions should 
make it absolutely clear to the examinees what is wanted. For 
example: 


25. Put a cross opposite two of the following English painters 
who are especially famous for their landscapes: 


Constable ...x... Reynolds. ........ 
Hogarth — ........ Romney. ........ 
Lawrence ........ Тигпег ваља «эе 


26. Put a cross opposite опе answer only. 
Sir Thomas Browne is best known as the author of: 
Fiction 
Plays 
Poetry . 5 
Works of travel . ........ 
None of these Ace Mum 


(vi) The same set of геѕропѕеѕ may apply to several questions. For 
example: 


Identify the following orchestral instruments by drawing a circle 


round S for strings, W fo i Р 
ie g r woodwind, B for brass, and P for per 


27. Clarinet. А 3 

28. Triangle. : у : ® Е & 
29. Double bass : е В 

30. English horn А В s 
3]. French horn в «8S у 


B 
® P 


(vii) Matching tests are an extension of the previous category. A 
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number of questions and a number of responses are listed in different 
order, and have to be fitted together. For example: 


32. On the left is a list of Treaties or Peaces, on the right a list 
of historical developments which were associated with these Treaties. 
After each of these developments write the /etter of the Treaty to 
which it corresponds. Note that some of these letters need not be 
used at all, and that some may be used twice. 


Peace or Treaty of: Historical Development: 

(a) Berlin Napoleonic empire broken up зс Ио 
(b) Frankfort League of Nations established ess) sut 
(c) Paris German annexation of Alsace- 
(d) Tilsit Lorraine пала ас 
(e) Utrecht France regains Alsace-Lorraine nea um 
(f) Versailles Independence of Holland finally 
(g) Vienna recognized ПЕ ла 
(h) Westphalia Britain becomes supreme colonial 

power а. 

Western powers set the stage for 
Balkan wars Рт EE 


Many matching tests have equal numbers of items in both columns, 
each item corresponding to one, and only one, in the other column. 
Since, however, this makes possible the discovery of responses bya 
process of elimination, the above plan, with unequal numbers, is 
supérior. The numbers of items should usually be five to twelve in 
each column; a greater number involves waste of time in searching. 
When possible the items in at least one column should be in alpha- 
betical order, to facilitate the search. It may be noted that this type 
is more compact than ordinary multiple choice, and so saves con- 
siderable space. 

The multiple-choice type of question is the most generally ap- 
plicable in educational testing—to diagrams, maps, etc., as well as to 
verbal or numerical problems. Not only factual information, but also 
quite elaborate reasoning and interpretation can be measured by it. 
In the best-reason type, for example, the alternatives provided may 
be graded in reasonableness; but only one answer should be com- 
pletely correct. Again in matching tests, the examinees can be re- 
quired to apply a series of principles to a series of problems. Very 
great care, however, is needed in the construction, and the following 
hints may be useful. 

(a) The multiple-choice type should not be used when simple- 
recall questions would be adequate, і.е. when there is only one 
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definitely right answer to the question. For example, it would be a 
waste of time to express arithmetical problems this way: 


33; ak Ет 
Indeed, most mathematical problems are better in open than in 
closed form, since the latter may enable the examinee to work back- 
wards from each answer until he reaches the right question, which is 
a very different matter from working forwards from the question. 

(b) All the responses provided must be sufficiently plausible to be 
selected by a fair proportion of the examinees. Otherwise there is no 
object in including them. The incorrect alternatives can often be 
made up out of misconceptions and errors which are likely to 
be prevalent among the examinees, Sometimes, also, they will be 
selected by many of the poorer examinees if they contain several 
words or. phrases identical with words or phrases in the question or 
introductory statement. Both correct and incorrect responses should, 
however, be homogeneous in their mode of expression, length, and 
other external characteristics. If the correct answer is always longer, 
or always shorter, than the incorrect ones, examinees may discover 


throw together such heterogeneous items that the examinees may be 
able to match them without any real knowledge of the subject. The 


34. (а) А geographical line along 
which atmospheric press- a 
ures are equal р Watershed °*........ 


(b) Scottish lakes or sea inlets 

(c) Winds which blow at a cer- | Monsoons — ........ 
tain season in the Indian 
Ocean Isobar 

(d) A high piece of land from В 
which rivers flow in two | Contour lines ........ 
or more directions 

(е) Lines оп a map showing Lochs gaga 

> heights above sea-level 


‘Contour lines’ must be (e), because 


‘lines’ are mentioned in (е). 
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*Monsoons' and ‘lochs’ must be either (5) ог (c), since these are the 
only remaining plurals. It is unlikely that ‘monsoons’ would be 
printed so nearly opposite its corresponding item, moreover ‘lochs’ 
sounds like ‘lakes,’ hence ‘lochs’ must be (5). (a) and (d) mention 
‘geographical line’ and ‘rivers,’ hence ‘isobar’ and ‘watershed’ will > 
probably fit them. 

The examiner should remember that a matching test is an ex- 
tended form of multiple-choice item, and should therefore see that 
for each item in one column there are at least three items, preferably 
four or more items, in the other column which constitute possible 
answers. 

(d) Not infrequently a better item may be obtained by turning 
round a partly formulated question into an answer, and making the 
correct answer into a question. For example: 

35. Emphasis by means of an understatement is called: 

Meiosis $ sc од еня 
Hyperbole . ETET R 


Euphemism . ss саан 
Metaphor . pee 


"This might be re-formulated as follows: 


36. Meiosis denotes: 
Emphasis by means of an understatement  . SU DE 
An exaggerated rhetorical expression . . — . ........ 
Substitution of a pleasant for an offensive expression vA кеа 
An imaginative simile : " ` , -— нај 


Probably the second form (No. 36) requires a fuller understanding 
of the subject than does the first (No. 35). A third, perhaps still 
better form of the question, which would demand the application and 
not merely the possession of knowledge, would be a set of phrases 
from among which examinees would select the one showing meiosis. 
In general the examiner should study each item in an attempt to 
anticipate all the mental processes by means of which the correct 
answer may be reached. If any of these processes are not representa- 
tive of the ability that he wishes to measure, then he should modify 
the item. 

(e) All statements and answers should be as succinct as possible, 
consistent with accuracy. Very verbose or complicated questions 
may depend more on the candidate's reading ability and intelligence 
than on his knowledge of the subject. (Possibly Qu. 24 offends in 
this respect.) 
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(f) It should hardly be necessary to point out that the position 
of the correct answer in the series must be chosen entirely at random. 
First and last places should be employed as often as any of the 
intermediate places. 


REARRANGEMENT TYPE 

A series of items which normally fall into a certain specific order 

is listed in disarranged order, and the examinee has to rearrange 
them. For example: 


37. Below is a list of waves or rays. Opposite No. 1 in the right- 
hand margin, write the Zetter (а, ог b, ог c, etc.) of the wave or ray 
with the longest wave-length. Opposite No. 2 write the letter of the 
wave or ray with the second longest wave-length, and so on down 
to No. 7. 


(а) Rays from a sodium vapour lamp 1 
(b) Waves from a black kettle containing hot 2 d 
water 
(c) Gamma rays " vem ue 
(d) Television transmission waves Ho. эшк Ыза 
(e) Rays from the green of the spectrum 5 e 
(7) Wireless telephone waves 6 g 
(g) X-rays Te жашбыз 
The mental processes involved in answering this question will 
naturally depend on what has been taught. If the pupils have been 
provided with such a list, they may answer by rote recall ; if not, then 
they may have to perform an elaborate process of analysis and re- 
synthesis of their knowledge of various branches of physics. 

By disarranging the order of phrases in sentences, Ballard (1923) 
provides a test which, he claims, measures knowledge of sentence 
Structure and of the canons of English style. The time sequence of 
historical events, the sizes of geographical areas, the order of opera- 
tions in some chemical or physical experiment, grammatical se- 
quences in foreign languages, and the like, can readily be expressed 
in this type. It has not, however, been very widely used, probably 
because of the difficulty of devising a simple and just method of 
scoring (cf. Sims, 1937). Ballard's (1923) plan seems fairlysatisfactory. 
Award one mark if the first item is correctly identified ; thereafter 
award one mark for each item which correctly follows the one 

a The reader may wonder why the instructions are not simplified by requiring 
examinees to write 1 after the item that should come first, 2 after the next, and 


so on. The reason is that the scoring of such a response would be exceedingly 
time-consuming, unless the whole of it was correct. 
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preceding it. For instance, if an examinee answers Question 37: 


(f) (d) (а) (е) (2) (b) (с) instead of (f) (d) (b) (а) (e) (в) (0), 


(f) is correctly placed first and scores one; (d) correctly follows (f) 
and scores one; (a) should not follow (d), but (e) and (g) are in 
correct sequence, and so score two; (c) does not score, although it is 
correctly placed last, because it should not follow (b). The total 
score here is thus 4 marks. 

Certain other minor types of question have been described by 
Ruch (1929), Rinsland (1938), and others. But these all appear to be 
variants of the simple-recall or completion, true-false, or multiple- 
choice types. 


PRACTICAL EXAMINATIONS 


Probably one of the most unreliable of conventional examinations 
is the practical in physics, chemistry, and biology and the engineering 
trade test; because a 3-, or even a 6-, hour period can provide only 
such a limited sampling of the skill or knowledge of the examinees. 
Though such examinations cannot be converted to wholly objective, 
multiple-choice form, it was found by American institutions for 
technical training of recruits during the war that great improvements 
were possible along the same lines. The operations at which examin- 
ees are supposed to be competent are analysed into a series of steps, 
and some 30 to 50 typical problems are selected, each of which can 
be done in a few minutes. For example, in electricity, a certain circuit 
is nearly completed, and the examinee has to make one essential 
connection which will show his grasp of the circuit without his hav- 
ing to do the whole wiring. In chemical analysis, he may be told that 
such and such reagents have given such and such results ; what should 
he do next? Several questions in an engineering examination may be 
based on capacity to choose the right tools for a given job. In others, 
parts of machines may be laid out and the examinee required to 
indicate how they should be repaired, and so on. Each of the fifty- 
Odd questions needs to be supervised by a separate examiner; 
senior students can often be employed for this purpose. The ex- 
aminees move round from one problem to another in turn, as many 
being examined simultaneously as there are problems. Each problem 
is usually marked as right or wrong, the right response being care- 
fully defined beforehand; though occasionally half-marks may be 
given if partial competence can be adequately defined. 
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FINAL STAGES IN THE CONSTRUCTION OF A NEW-TYPE 
TEST 


The examiner should rescrutinize every item, try to eliminate all 
ambiguities, and ensure that the whole of cach item is going to 
function, i.e. that every word or phrase will be essential in arriving 
at a correct response, and that there is no possibility of obtaining 
correct responses through irrelevant clues (cf. Hawkes and Lindquist 
et al, 1937). This can be done much more effectively if the questions 
are submitted to other experienced examiners who may detect 
flaws which the author has not noticed. 

Next it is essential to make up comprehensive instructions for each 
type of question, which will be appropriate to the mental level of the 
dullest examinees. Unless the examinees are fairly sophisticated to 
new-type testing, sample questions already answered should be in- 
cluded. In classroom work it is a good plan to provide three samples, 
one already answered, to read this through with the class and point 
out why the given answer is correct, and then to ask the class to 
suggest answers to the other samples, helping any individual mem- 
bers who still do not understand what is wanted. 

Attention should then be directed to ease of scoring. If all the 
answers can be aligned in a column in the right-hand margin, a key 
can readily be prepared consisting of a sheet for each page of the 
test, down the left-hand edge of which are written the correct answers 
in the correct positions. 

It might be Supposed that as some questions are harder than 
others, they should be allotted more marks. But actually many 
experiments show that weighting is hardly worth the trouble. The 
unweighted total scores correlate almost perfectly with weighted 
totals. A very simple form of weighting such as the following may be 
used: award 1 mark to each correct simple-recall or completion 
response, also to each correct response in a matching item; give 1 
mark to each true-false or two-response multiple-choice item and 
apply the R-W correction; give 1 mark to each three-response 
multiple-choice answer, but neglect the correction for guessing; give 
2 marks to each multiple-choice item where four or more responses 
are provided, 

Although scoring is so rapid and fool-proof, 
particularly in adding up totals. Either the: 
through the papers twice, or else let the st 


slips do occur, 
n the marker should go 
udents or pupils re-mark 
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their own papers and ask them to bring any errors to his notice. 
Finally the examiner will certainly improve his future examining 
if he takes the trouble to study the difficulty and the diagnostic value 
of each item. A simple method is to sort the papers into four piles— 
the quarter with the highest totals, the quarters above and below the 
median, and the quarter with the lowest totals. The numbers of times 
each item is correctly answered by members of each group, and by 
the four groups combined, may be tabulated. The examiner may 
then compare the actual difficulties of the items with his estimates of 
difficulty made when setting the paper, and may note whether the 
proportions of passes for each item ascend regularly from the lowest 
to the highest group. If they fail to do so he should attempt to 
discern the flaw in the item which may be responsible for its poor 


validity. 


M.A.—17 


CHAPTER XIV 


IMPROVEMENTS IN EXAMINING 


EXAMINATIONS have far too great a burden to bear. They are sup- 
posed to indicate, as shown in Chap. XI, a large number of only 
partially related traits and abilities. If we could define more clearly 
what it is that we want to measure, and use for each purpose the 
most appropriate instrument, there is no doubt that we could effect 
a great improvement in the efficiency of these instruments. Let us 
consider first some of the substitutes for examinations which have 
been recommended by educationists, and the functions which they 
may serve. 

Oral Examinations, Viva Voces, and Interviews.—Oral examina- 
tions are, of course, much older methods of educational measure- 
ment than written examinations. Their main defects were pointed 
out when written examinations began to take their place, over one 
hundred years ago. It was recognized that the situation to which 
successive candidates are subjected cannot be the same for all, so 
that no sure basis exists for a comparison between different candi- 
dates, The examiner's tone of voice, the wording of his questions, 
his encouragement or criticism, etc., may vary widely from one 
candidate to another. If, as in school inspections, all the candidates 
are interviewed in the presence of one another, the examiner has to 
pose a series of different questions which are unlikely to be closely 
comparable in difficulty. In any viva voce the amount of time de- 
voted to each candidate is usually so short that reliable indications 


of his abilities can hardly be expected. There is the further defect that 
the situation may cause much 


does a written examination. The 


may, indeed, reveal Something of his int 


erests, his attitude to work, 
and cultural background, which are ed 


ucationally relevant. But as 
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often applied in 11+ selection, the secondary school Head's inter- 
view probably does little more than assess social class and nice 
manners. Oral examinations in foreign languages will bring out the 
candidate's goodness of pronunciation, but cannot be trusted to 
show his facility in conversation, because of the distorting effects of 
emotion. For the same reason, viva voces in other subjects cannot 
throw much light on the candidate's capacity for formulating his 
knowledge orally. Theoretically they should constitute a valuable 
supplement to written examinations, since they provide a fresh 
medium of expression for ability. But investigations have shown that 
judgments based on interviews tend to be still less reliable than those 
based on essay-examination answers. 

Hartog and Rhodes's (1935) study of the Civil Service interview 
is particularly apposite. Sixteen candidates were interviewed in the 
ordinary way by two different boards, each composed of four or five 
experienced examiners. The members of each board consulted 
beforehand as to the questions they would ask (but the two boards 
did not inter-communicate), and had before them the candidates' 
educational records. During the quarter- to half-hour interviews the 
members attempted to assess a candidate's alertness, general in- 
telligence, and the suitability of his personality for the Civil Service. 
After discussion they drew up a final mark. It was found that mem- 
bers of any one board did agree fairly closely in their judgments, but 
that the two boards had reached widely differing conclusions. The 
two final marks differed by 12% for the average candidate, in one 
case by as much as 31 %, in another by as little as 1%. The correla- 
tion between the two sets of marks was only --0:41 + 0:14. Except 
for the small number of the candidates, there seems to be no 
technical flaw in this experiment which would allow us to challenge 
its depressing results. 

There have been many other studies of the interview in the 
educational, vocational, and military fields, showing similar in- 
consistencies among interviewers (cf. Vernon and Parry, 1949). In 
view of the frequent use of the interview by university authorities for 
selecting students, Himmelweit and Summerfield's (1951) study is 
very relevant. They obtained almost zero correlations between pre- 
dictions made by university staff from interviews and subsequent 
degree performance, whereas an academic entrance examination gave 
small positive correlations and a battery of psychological tests 
distinctly better ones (cf. also Vernon, 1953). It would seem that 
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oral questioning and the interview, though essential in education for 
purposes of teaching, are useless for purposes of assessing ability or 
the results of teaching. In the course of a school year a teacher can 
undoubtedly build up a fairly comprehensive and reliable picture of 
each of his pupils' abilities, which may be largely derived from oral 
work. But the conditions here are very different from those in a 
brief viva voce. The volume of questioning is far larger, the emotional 
atmosphere less strained, and the oral answers are supplemented by 
written work. 

Perhaps the best instance of the value of the interview is provided 
by the work of vocational psychologists. In advising a child on his 
career the psychologist applies several objective tests of aptitudes, 
and takes both medical and educational records into account. But 
his information as to the chilc's personality and interests is derived 
chiefly from an interview. This is conducted informally so as to put 
the child at ease, and though it follows no Stereotyped plan of ques- 
tioning, it has, in fact, been designed very carefully so as to elicit 
the child's main traits. A further function of this interview is that it 
enables the psychologist to interpret the test results and other 
objective records, and to realize their significance in a synthetic view 
of the child as a whole. Guidance, which is based merely on the 
mechanical application of tests or examinations, is not successful. 
But there is ample proof that the guidance given by trained psycho- 
logists who adopt these test and interview methods does possess a 
high degree of predictive validity. Again in vocational selection for 
skilled trades, the psychologist can often use the interview success- 
fully for assessing the amount and the relevance of previous trade 
experience, though—when conditions permit—he also constructs 
and validates a battery of selection tests (cf. ftn. p. 145). For higher- 
grade occupations such as managers, officers in the Services, and 
teachers, where tests seldom seem to be of any value, the technique 
of observing small groups of candidates (as developed by War 
Office and Civil Service Selection Boards), combined with a psycho- 
logist’s interview, is the most successful (Vernon, 1953). 

It would seem, then, that the interview, as at present employed in 
educational circles, is practically valueless, but that in skilled hands 
it may become a most useful tool, There is no reason other than 
expense and the scarcity of competent psychologists why the methods 
of vocational selection and guidance should not be applied in educa- 
tion, both for choosing pupils most likely to benefit from higher 
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education, and for advising them as to the lines of study for which 
they are best fitted. This is already common practice in many 
American and some Australian Universities. It is occasionally used 
here at the 11+ selection stage, though necessarily confined to a 
small proportion of borderline candidates. 

Intelligence and Special Aptitude Tests.—Theoretically, intelligence 
tests should be able to fulfil one of the functions commonly assigned 
to examinations much better than examinations do at present, that 
is the selection of pupils or students with the greatest intellectual 
promise, as distinct from achievement. As we have seen, this is not 
because they really measure innate ability, but rather because they 
show the general level of thinking that the child has developed out- 
side school, which he should be able to apply in school. They should 
also be able to measure objectively the qualities which essay-type 
examinations are supposed to elicit, and which new-type tests are 
supposed to be incapable of eliciting, such as *originality, power of 
thinking, grasp of general principles, etc.’ Experimental investiga- 
tions indicate that they do have some value for such purposes, but 
that they need to be employed with caution. They seem to be most 
useful among younger children. The Terman-Merrill, applied at 6 or 
7 years, correlates very highly with educational achievement for the 
next few years; coefficients of -- 0:80 or more may be expected. At 
the 11+ selection stage, the group intelligence test usually gives 
slightly better correlations with subsequent achievement than any 
other single instrument. Within a selected grammar school popula- 
tion, correlations would be expected to drop to about + 0-40, hence 
there will naturally be many discrepancies between 1.0. and bright- 
ness as observed by the school teachers. But when these teachers 
criticize tests they fail to realize how very much worse most of the 
pupils with I.Q.s of 110 down to 70 would be. In the modern school, 
correlations remain relatively high. 

At later ages, the agreement between general intelligence tests 
and academic work is usually lower still, because of the still greater 
restriction of range. In American universities, with their more 
heterogeneous students, coefficients still average around 0-4 to 0:5 
(cf. Eysenck, 19475). But here the figure is usually nearer 0-2 to 0:3 
with degree results. Similar correlations are found among Training 
College students with their combined theory marks; with practical 
teaching marks there is little or no agreement. 

More useful predictions at late adolescent and adult levels are 
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likely to be reached through testing special aptitudes or group fac- 
tors. Prognostic tests are intermediate between intelligence and 
attainment tests. They try to show, not how much mathematics or 
English the pupils have previously acquired, but how well they can 
apply their intelligence plus acquirements to fresh mathematical or 
linguistic problems. Earle (1949) has had some success with English, 
Science, and mathematics tests, and with his Duplex tests, in diag- 
nosing success at different types of secondary courses. Many Ameri- 
can colleges give ‘placement’ examinations to incoming freshmen, 
consisting of linguistic and numerical + non-verbal intelligence tests, 
together with English, mathematics, science, foreign language, and 
social studies tests, in order to determine the students’ main aptitudes. 
There is room for considerable development of such work in our own 
schools and universities (cf. Dale, 1954). 

Cumulative Record Cards.—The primary function of many exam- 
inations is to provide a certificate of work done. Ostensibly this is 
true of the General and Scottish Leaving Certificates, and of univer- 
sity degree examinations, though they are also often interpreted 
prognostically. Now, in most schools the teachers know quite well 
what work each pupil has done. Hence there is every reason for 
making more use of teachers’ records of this work and less use of 
external examinations. The former cover the whole ground, the latter 
never more than a fraction of it. The former are collected under 
everyday conditions, the latter when many pupils are in an abnormal 
emotional state due to cramming and fear of failure. The former are 
also less likely to produce the degrading effects upon teachers’ and 
pupils’ educational values which are characteristic of public com- 
petitive examinations. Teachers? markings of school work are, of 
course, just as liable as examiners’ marks to subjectivity and un- 
reliability ; perhaps more so, because of the prejudices they are likely 
to possess regarding each pupil’s character and ability, which тау. 
influence their judgment of his written work, But if in the course of 
his career a pupil is taught by several different persons, each of whom 
records his achievements according to some uniform System, the 
cumulative record should yield a fairly reliable picture. Although, 
as stated above, the function of such records should be to certify 
Work done rather than to predict future promise, yet the investiga- 
tion by the Scottish Council for Research in Education (1936) into 
the Leaving Certificate showed that teachers? marks did in fact cor- 
relate with subsequent university success just about as well as did the 
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external examiners’ marks. Likewise in McClelland's (1942) and 
several subsequent studies of secondary school selection, scaled 
primary school teachers’ estimates have given better predictions of 
grammar school work than have objective attainments tests. The 
many other uses to which cumulative records can be put, such as 
vocational and educational guidance, have been fully described by 
Hamley ег а! (1937) and Walker (1955). 

The objection which will at once be raised to such schemes is—how 
are the standards of different schools to be compared? McClelland 
(1942) demonstrated the enormous discrepancies between teachers" 
assessments of their pupils and the levels actually obtained by the 
same pupils on objective tests, some teachers being far more generous 
than others. Indeed unless some system of scaling or standardization 
of assessments is adopted, the inclusion of such estimates in a selec- 
tion procedure is known to lower its validity. Several possible 
solutions to this difficulty have been tried. 

Valentine (1938) proposed that at 11 + one or more general in- 
telligence tests should be given in all schools under one authority. 
If, say, one thousand grammar school places were available, and an 
І.О. of 115 was found to cut off the top thousand pupils, then ће. 
number of pupils in each school scoring 115 and over constituted 
that school's ‘quota.’ Meanwhile the head and staff of the school 
would have ranked their own pupils in order of suitability for 
grammar school education. This rank order and the 1.0.5 would 
finally be combined with appropriate weights. Each school would 
send its allotted quota to the grammar school, but the precise pupils 
included in the quota would be those whom the primary teachers 
had chosen as most able, not merely those with the highest I.Q.s. 
While this quota scheme has, in fact, been used in certain urban areas. 
it is quite unsuitable for areas where there are numerous small 
schools, each submitting a dozen or fewer candidates. Even with big 
schools it is rather unreliable since, as shown above (p. 209), a percen- 
tage or proportion has a very large Standard Error. 

An alternative plan, which is receiving experimental trials, is to 
fix each school's quota initially at the same number as gained gram- 
mar school places in the previous year (or recent years). A few pupils 
who are ranked by their school as just above or below this borderline 
are examined, and the quota is adjusted up or down according to 
their results. This method is referred to by the Ministry of Educa- 
tion as selection without any external examination—not even an 
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intelligence test. We have already pointed out that fairly considerable 
variations in standards may occur, by chance, from year to year 
(p. 63). Thus it is possible that the method, though admirable in 
intention, may break down through the necessary quota adjustments 
becoming too numerous and unwieldy. 

Many American Universities employ what is, essentially, an elabo- 
ration of the same scheme. They record the final secondary school 
marks of students coming from various contributory schools for 
several years, together with the performance of these students at the 
university itself. Eventually it becomes possible to determine the 
relative standards of the secondary schools concerned. For example 
it may be found that students with 60% from School A are just as 
good as those with 80% from School B. Thus the numbers of students 
worthy of selection from each school can be deduced. 

Probably the best planisthatadvocated by McIntoshet al (1949), the 
National Foundation for Educational Research and others, of scaling 
each school's own marks or rankings against some uniform external 
set of attainments tests, by an adaptation of the techniques described 
above (pp. 54-8). Selection can then be based primarily on the teach- 
ers’ judgments, derived from internal school examinations and from 
cumulative records of junior school work. But each set of judgments 
is calibrated or scaled up or down to accord with the distribution 
obtained by that school's candidates on the external, objective tests. 
An agreed weighting can be given to the objective tests also. Such a 
scheme too requires fairly large numbers per school, though Mc- 
Intosh finds it workable even among schools submitting only half a 
dozen candidates. It has the disadvantage that schools are still likely 
to cram for the external tests, but it does take full account of teachers’ 
knowledge of their own pupils. A useful discussion of these and other 
problems raised by the various selection procedures at 1] + is given 
by Dempster (1954), 

Methods such as these require to be worked out and supervised by 
a statistically trained psychologist. As described here, they might not 
appear to be appropriate as substitutes for the General Certificate 
or other higher-level examinations. And yet the accrediting systems 
adopted by many Australian and New Zealand Universities show 
that something similar can be devised, which will largely banish the 
strain and unreliability of externally-conducted examinations among 
secondary school leavers, and yet provide a reasonable guarantee of 
scholastic achievement and of Suitability for university courses. 
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IMPROVEMENTS IN ORDINARY EXAMINATIONS 

Sinceexternal examinations will for a long time play a large, though 
we hope diminishing, part in education, and since even in an ideal 
system teachers must apply internal examinations, and mark their 
pupils’ work, we must consider carefully the improvements which 
might be made in the examination as it now is. From the investiga- 
tions we have described, and from the discussion in previouschapters, 
the following principles may be deduced. 

1. The teacher or examiner should formulate as precisely as pos- 
sible what it is that he wants to measure. This implies first a clear 
conception of the objectives or aims to which, in his view, education 
should be directed. At the primary school level he тау be content 
with the view that education consists in implanting certain tradi- 
tional knowledge in children's minds, such as the way to work out 
sums and to parse sentences. This may be one objective in the secon- 
dary school and university also, though surely not one of the most 
important. History, science, foreign languages, and thelike do involve 
the acquisition of a certain body of facts—dates, chemical formule, 
meanings of words, etc.; but they are supposed to possess many 
additional values. Taste for literature, interest in foreign peoples, 
scientific habits of mind, ability to apply knowledge to the interpre- 
tation of present-day events, are but a few of the things which we 
hope to inculcate. In making such an analysis, the teacher must be 
careful to take into account the facts which psychologists have 
established in regard to transfer of training, namely that general 
faculties, like observation, memory, and reasoning, will not be deve- 
loped merely by the study of some particular school subject, and 
that indeed these faculties have little real existence. For example, a 
rational and scientific approach to contemporary problems can be 
developed by studying typical problems rationally, by conducting 
discussions and projects, by showing how irrationalities of thinking 
arise, and so on; but лог by the carrying out of stock experiments in 
physics, nor by memorizing the rules of grammar or logic. A Be- 
havioristic formulation should be helpful: instead of talking in terms 
of vague mental qualities, we should ask—what can the educated 
pupil or student do in each specific field of study, or in the world at 
large, which the uneducated one cannot do; and what are the precise 
elements of our teaching which are supposed to confer upon him 
these capacities? 


254 THE MEASUREMENT OF ABILITIES 


Having formulated as candidly as possible our general philosophy 
of education, we can ask which of these objectives are to be examined. 
We may decide that desirable moral conduct and social adjustment 
cannot be expressed in written work. But each objective should be 
considered from the point of view—what can the educated pupil 
write, or what problems can he solve, which the uneducated one 
cannot. 

The external examiner, who has not been responsible for teaching 
the examinees, should adopt a similar procedure in deciding what his 
examination is meant to measure. He needs, however, to be especially 
careful to emphasize ‘can’ rather than ‘ought’ in his analysis. It is 
merely futile for him to complain that teachers are not doing their 
work properly, and to insist that examinees must conform to intellec- 
tual standards which may be current among persons like himself, but 
which are quite unattainable by childien or immature students. His 
Job is to set tasks which will differentiate the better pupils who exist 
in real, not ideal, schools, from poorer ones. If he is supposed to pick 
out, say, the best 309%, but formulates objectives which even the best 
5% can hardly achieve, then he is not doing his work efficiently and 
is creating unnecessary difficulties for himself when the time for 
marking arrives, 

A group of persons, such as a head master and his staff, or a board 
of examiners, should be able to carry out such analyses much more 
logically and thoroughly than can any one person, however clear- 
headed he may be, since they can criticize one another’s ideas. 

Finally, the conclusions of this analysis should be communicated 


miners who try to catch out their victims, 
nd out what their examiners really want, 
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new-type questions. If the numbers are very small, he may be justified 
in shirking the extra expenditure of time required by objective exam- 
ining. And he should know from past experience that he is more 
successful in devising new-type measures of some things than of 
others. 

Generally speaking, any kind of factual information, or straight- 
forward skills, such as the carrying out of arithmetical operations or 
grammatical analyses, should be examined in new-type form, instead 
of being left to the vagaries of subjective marking. This need not 
mean that only trivial and unrelated items of knowledge or skill 
should be included. Good questions, as we have seen, can bring out 
the examinees’ ability to interpret the significance of facts, grasp o: 
general principles, and application of such principles to specific new 
problems. True, there are many dangers in these more elaborate 
items, but they are hardly as serious as the dangers of trying to 
discover all the underlying abilities of the examinees from their 
answers to essay questions. 

We admit, however, that many important aspects of ability may 
be displayed more clearly in essay-type answers or (with subjects 
Such as mathematics and physics) in the working out of long and 
complex problems. Literary style of composition, imaginative 
qualities and originality, evidence of the examinee's interest in the 
subject and of his own thinking about it, are some of the things 
Which the examiner may wish to elicit, even if he can never hope to 
measure them perfectly reliably. If the examination is. designed 
primarily to bring out qualities such as these, then the examinees 
should be told so, and should be informed that amount and accuracy 
of information will not, in this instance, be taken into consideration. 
Frank directions about what is wanted are particularly desirable 
when the same examination includes some new-type and some 
essay-type questions. 


3. If, for one reason or another, essay- : ) 
maids-of-all-work, then the examiner should ask himself which parts 


of the total field to be examined are the most important. Knowing 
that his questions can sample only a fraction of it, he should avoid 
the less essential parts, even at the risk of his questions being easily 
‘spotted’ beforehand. Overlapping among the questions is obviously 
undesirable also, since it is wasteful to cover any one part twice over. 
The unrepresentativeness of single papers, and the desirability of 
improving reliability by increasing their number should be borne in 


type questions have to be 
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mind. With each question, the length of time needed for a reasonable 
answer should be considered, so that the examinees shall not be set 
too much, nor too little, to do in the period allowed. 

Much investigation has been made into the kinds of essay ques- 
tions which are least unreliable, particularly by the New York 
College Entrance Board (cf. Lindquist, 1951). It has been found that 
rather short discussion questions, each of which poses a definite 
problem and supplies specific indications of what is wanted, are the 
best. Such questions still give plenty of scope for the examinee to 
reveal his powers of marshalling and organizing his material. The 
much more vague, yet very popular, type of question which allows 
greater freedom of response to the examinee only leads to confusion 
and difficulty in comparing one examinee with another. 

4. Perhaps the most important quality in a good examiner is his 
ability to foresee the sort of answers he will get, both through his 
insight into the minds of pupils or students, and through his experi- 
ence of their answers in the past. If in formulating each question he 
possesses a clear conception of the probable answers—good, 
average, and bad—the following advantages are likely to accrue. 
The questions will be simply worded, free from ambiguities, and 
they will not be interpreted in such divers ways that the answers are 
too heterogeneous to be compared, or marked consistently. The 
questions will also be appropriate in difficulty, not, that is to say, the 
type of questions which no one but the examiner himself could 
answer adequately. Most important of all, the production of answers 
to these questions will really bring into play the abilities which the 
examination is aiming to measure, When, later on, the examiner 
reads and marks the answers, he should continu 


ally be trying to 
improve this ‘faculty,’ 


by noting how far his expectations are ful- 
filled. Every question may be regarded as an experiment in forecasting 
which, although never likely to be completely successful, may be- 
come progressively more so with practice. 

In this, as in all other branches of examining, 
likely to be better than one, three than two. A bo 
will perhaps be the most efficient. 

5. When €ssay-type questions call for 
nature, a detailed scheme of markin 
especially necessary when several di 
certain proportion of the papers. E 
teduce their variability by roughly 


two opinions are 
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answers of a largely factual 
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fferent examiners each mark a 
xperiments indicate that it will 
one-half. But a single examiner 
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will also greatly improve his consistency if he decides beforehand 
just what points to look for, and marks these objectively. 

In assessing the more general qualities which essay answers are 
supposed to bring out, the quality scale method (cf. p. 34) should be 
adopted. The examiner should always be on the alert to ensure that 
he is judging the scripts solely on the basis of the criteria which he 
has previously formulated, that he is not being biased by bad hand- 
writing or spelling, by some trick of style which jars on him, or by 
some personal predilection regarding the subject-matter. All marking 
which cannot be done objectively should consist largely in intro- 
spection: “Why does this strike me as good or bad? Are these the 
qualities which I am supposed to be judging, and are they the qualities 
which I took into account when I selected my standard specimens?" 

Many questions may advantageously be marked both by the 
qualitative and by the objective (count) systems, the figures then 
being combined in due proportion. If two or more examiners mark 
the same, or even a few of the same, scripts, it will be particularly 
valuable to analyse the reasons for any discrepancies in marks, so as 
to clarify still further their criteria of good and bad. 

One other source of prejudice often arises in marking if the 
examiner is personally acquainted with some, or all, of the examinees. 
Either he should carefully avoid looking at the names on the papers, 
or, if he cannot help recognizing their writing or style, should do his 
best to disregard any previous impressions he may have formed 
about them. 

6. With the statistical aspects of marking we have already dealt 
in Chaps. II-IV. But the following main points in subjective mark- 
ing (as distinct from count scoring) may be recalled. Marking is first 
and foremost the arranging of scripts in order of merit, or grouping 
them into categories of merit. It is not an attempt to give an absolute. 
judgment of the value of each script. All the answers to one question 
should be classified into such categories before starting on another 
question. The categories should be, approximately, evenly spaced, 
and should not exceed ten in number. The frequency distribution of 
marks for each question should tend to the normal shape, and the 


average mark should be very close to 5 (assuming the middle cate- 


gory of merit to have been labelled 5). The dispersion of the distri- 
nstant for all questions, 


bution should also be approximately constan 
unless it is desired to weight some more heavily than others. The 
final total scores for the scripts may be converted into an appropriate 
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distribution of percentage marks by any of the three methods given 
in Chap. IV. All decisions as to standards of passing and failing should 
be postponed until the process of marking has been completed. 

The final step in many examinations is to reduce the marks to two, 
three, or more categories (Pass or Fail; First, Second, Third Class, 
etc.). Here it is most desirable that all Scripts which fall within a 
few per cent. above or below any important borderline should be 
read a second time, preferably by more than one examiner, so as to 
ensure that the eventual decision shall attain the utmost reliability 
of which fallible human judgment is capable. 
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Nelson-Denny Reading test,75, 81-2, 83 
New York College Entrance Board, 256 
Northern Test of Educability, 184 


Northumberland Standardised tests, 
176, 178, 182 


Oakley, С. A. ү 262 
Oden, M. кер A "> 
О Согтап, M., 68, 264 
Oldham, H. W., 138, 262 
Org minute tests, 21, 173, 174 
is, A. S., intelligence tests, 182-3 
Parry, J, B. 
Passalong fac а "st 
Pear, T. Ha 186, 265 
Peel, E. A., interests test, 187 
practical ability tests, 185 
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Penrose, L. S., Pattern Perception test, 
182, 262 

Percival, T. S., French tests, 178 

Piaget, J., 156, 262 

Picture Frustration test, 187 

Pidgeon, D. A., intelligence test, 182 

Porteus, S. D., Maze test, 181, 193, 
262 

Primary Mental Abilities (P.M.A.) 
tests, 144, 185 

Progressive Matrices, 179, 182 

Pullias, F, V., 127-8, 262 


Raven, J. C., Progressive Matrices test, 
179-80, 182, 262 
projection test, 187, 262 
vocabulary scales, 174, 179 
Rhodes, E. C., 27, 203-5, 247, 261 
Richards, M. K. B., clerical test, 185 
Richardson, C. A., intelligence tests, 
183 


Rinsland, H. D., 243, 263 К 

Rosenzweig, S., Picture Frustration 
test, 187 

Rothney, J. W. M., 68, 260 

Ruch, б. M., 221, 243, 263 


Saffir, M., 260 
Schonell, Е. E., 263 
English test, 177 
Schonell, Е. J., 154n., 158, 263 
educational tests, 155, 173-5, 177 
intelligence tests, 183, 263 
personality schedules, 187 
Scottish Council for Research in 
Education, 70n., 159, 209, 250, 263 
Seguin-Goddard Formboard, 181 
Seven Plus Assessment tests, 175, 178 
Simplex Intelligence test, 183 
implex Junior tests, 183 
Sims, V. M., 242, 263 
Slater, P., personality tests, 186-7, 259 
Sleight, G. F., intelligence test, 182 
Smith, I. M., mathematics test, 177 
spatial test, 184 


Southend Arithmetic test, 178 
intelligence test, 182 
Spearman, C., 104, 127, 129, 137, 
И 147, 150, 152, 167, 168, 


2 
Stanford-Binet tests, 72-3, 1 12-18,152, 


159, 179; see also Terman-Merrill 
tests 


Starch, D., 203, 263 
Steel, J. H., 21, 263 

tephenson, W.. intelli ence 
Summerfield, A’, 247, 261 ^ 1850 182 
Sutcliffe, A., 186, 263 
Sutton Booklet, 186 
Symonds, Р. M., 228, 262 
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alman, Ј., 21, 263 
ен L. M., 152, 157, 159, 192, 
263 
an-Merrill test, 21, 22, 23, 68, 72, 
— Ne 162-6, 167-9, 179, 
et tests, 176, 178, 182 
anet tests, 176, 178, 
oe G. H., 66, 123, 1377., 139, 
“141, 143, 1457., 147, 173, 174, 263 
intelligence tests, 183 
Thorndike, E. L., 23 
Thorndike, R. L., 123, 128, 263 
Thouless, R. H., v, 74n., 121z., 129, 
263-4 
Thurstone, L. L., 23, 141, 143-4, 147, 
260, 264 
primary abilities tests, 185 
Tomlinson, T. P., intelligence tests, 
181, 184 
Tulchin, S. H., 164, 264 


Valentine, C. W., 208-9, 251, 264 
intelligence and reasoning tests, 160, 
180, 186 
Vernon, Н. M., 187., 264 
Vernon, P. E., 19л., 68, 132, 135, 139, 
144n., 147, 151, 157, 161, 180, 181, 
186, 187, 193, 202, 247, 248, 264 
educational tests, 173, 177 
Study of Values, 187 
Vincent, D. F., mechanical test, 185 
Vineland Social Maturity Scale, 187 


WAIS, see Wechsler 

Walker, A. S., 187, 251, 264 

Walton, R. D., Geometry test, 177 

Watts. A. F., educational and language 

tests, 174, 176, 264 

performance test, 181 
spatial test, 184 

Wechsler, D., intelligence tests, 68, 72, 


133, 157, 160, 161, 162, 164, 168, 
170, 179, 194, 264 


Weiss Long, А, E., 186, 265 
West Riding intelligence test, 184 
West Yorkshire intelligence test, 184 
Whilde, N. C., 154n., 265 
Whipple, G. M., 186, 265 
Wilmer, E. N., 186, 265 
Wilmut, F. S., 209, 260 
Wing, H. D., Music tests, 186, 265 
WISC, see Wechsler 
Wiseman, S., 205, 265 

intelligence test, 184 

interest test, 187 
Wood, B. P., 203, 265 


Yates, F., 85, 261 
Yule, G. U., 107 


INDEX OF 
Ability, discriminable grades of, 31, 34, 
65-6 


factorial analysis of, 112, Ch. УШ, 
152 


Boodness and poorness of, 24-5, 
32-3, 53, 56, 58, 62-5, 67, 150 
measurement by sampling, 20, 87, 
127, 131-2, 152-3, 200-2, 212, 227, 
229, 255 
methods of scoring, Ch. II 
nature of, 19, 130-2 
numerical evaluation of, 18-19, 27, 
28-31, 33-5, 56-65, 149, 203 
qualitative grading of, 11, 26-8, 32, 
34-5, 205, 257 
rate and power plans of scoring, 21, 
22-3, 155 
units in measurement of, 3, 21-5, 30, 
59-60, 65, 70-1, 194 
Accident proneness, 17-18 
Accrediting, 252 
Achievement tests, see Educational 
tests 
Age, allowances, 53, 195и. 
determination of average, 39, 43-4 
factor in correlations, 120-2, 128, 
131, 134 
growth of abilities with, 69-71, 135, 
152, 157n., 209-10 
ranges of tests, 159, 160, 162, 171-2, 
188-9 
tabulation of, 5 
American examinations, 198, 252, 256 
new-type, 211 
American tests, 170, 172 
British performance on, 67, 75, 81, 
149, 159, 160 
construction of, 189-90 
for secondary schools, 155 
prognostic, 250 
purchase of, 170 
Analysis of variance, 85-7 
Analytic schemes of marking, see 
Marking 
Arithmetic ability, group factor of, 
131-2, 126, 137, 138, 143 
measurement of, 19-20, 25, 132, 135, 
207, 211, 255 
new-type examinations of, 232, 240 
range of, 63, 65-6 
tests of, 21, 153, 154, 164, 174, 177-8 
Arithmetic Age, 69 . 
Attainment tests, see Educational tests 
Attenuation, 129, 1347., 209 
Attitude tests, 19, 187 
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Australian tests, 170-2 
university entrance, 249, 252 
Average, see Mean 
Average Deviation (A.D.), see Mean 
Variation 


8 3 2 

Bi-factor analysis, 140-1, 147, 15 

Bimodal distribution, see Frequency 
distributions 


Character traits, factor analysis of, 1! 
tests and ratings of, 13, 19-20, w 
126, 144, 148, 154, 186-7, 197-8, 
iens f tests in, 39, 145 
Child guidance, use of tests 1n, 57, d 
| 147, 154, 161, 164-5, 168-9, 192, 
193, 208 ed 
Children, effects of examinations ОП, 
199,246,250 — | 
with language handicaps, testing of, 
158 
young, testing of, 158, 159, 160-1, 
162, 164, 168-9, 191 dida 
Children's attitudes, to marks, 22, ^^ 


to new-type examinations, 213 
to tests, 158, 161, 164-5, 169,191 
Chi-squared (7? ), 6n., 88-91, 1 247 
Civil Service, selection for, 135, , 
cat bil ts, 185 
Clerical ability tests, - 
Coarse grouping, effects on correla: 
tions, 98, 101 
on means, 42 á 
on standard deviations, 100 
Coefficient of Association, 107-8 
Colligation coefficient, 107 
Colour blindness, tests of, 186 
Column diagram, see Histogram 
Combination of tests, see Marks 
scores 
Communality, 137, 142n. d 
Comparison of abilities, see Marks an 
Scores 
Compensation theory, 130, 133-4 
Concept formation, 146, 157 
Confidence limits, 78, 80, 83, 110 
Contingency coefficient (C), 104-7, 


Correlation, 
118-29 


and 
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Correlation (cont.) 

biserial, 108-9 

combining and contrasting coeffi- 
cients of, 110-11 

effect of, upon reliability formulae, 
82, 84-5 

forecasting efficiency, 112, 118-20, 
123, 209 

from regression lines, 115 

from small populations, 76, 109, 110 

in homogeneous populations, 122-3, 

. 128, 139, 201, 205, 249 

influence on, of irrelevant factors, 

_ 120-3, 131 

interpretation of, 118-23 

inverse, see negative 

Kendall’s method, 104 

multiple, 124—5, 145n. 

nature and objects of, 2, 92-4, 112, 

1 


negative, 94, 107, 132, 134 
partial, 121-2, 131, 136, 139 
Pearson-Bravais method, 94 
Probable Errors of, see Reliability 
product moment methods (r), 92, 
94-102, 108, 109, 115 
rank order method (р), 26n., 92, 
103-4, 110 
reliability of, q.v. 
гё, 106-7 
Spearman foot-rule method, 104 
tetrachoric (т), 108 
z technique, 110-11 
Correlation Ratio (7), 104-5 
Count scores, 3, 20-5, 28-30, 212, 257 
Critical Ratio (C.R.), 80, 82, 85 
Cross-classification of pupils, 92, 136 
Cumulative frequency graph, 7л. 
Cumulative record cards, 163, 187, 
250-1, 252 


Deciles, 37-8, 67 

Бе, of Freedom, 857., 89п., 90n., 
Developmental Quotient, 1577. 
ELI tests, 154, 170, 174, 177, 


Dictation, see Spelling 

Dispersion of measures, 1, 16, 36, 40, 
‚ 45, 53-4, 62-3, 65-6, 81, 87, 123-4 
indices of, see Меап Variation, 
, Range, Standard Deviation 

Distractions in testing, 191, 192-3 

— ability, tests, 135, 136, 155, 


Educational ability, general factor of, 
137, 139 

Educational Age (E.A.), 69, 70-1, 171, 
188-9, 194—5 


Educational backwardness, 71, 121, 
158, 161-2 

Educational guidance, 3, 31m., 68, 
144—5, 154, 163, 167, 170, 194-6, 
198, 248-9, 251 

Educational objectives, 198, 253 

Educational Quotient (E.Q.), 71, 157-8, 


194 
Educational tests, 21, 317., 153, 154—5, 
163, 173-8, 227 


choice of, 153, 188-9 
construction of, 153, 189 
correlations between, 134-5 
functions of, 154-5 
norms, 69-73, 154 
reliability of, 126, 127-8, 134л., 150, 
227 
validity of, 150-1, 228, 250 
See also Arithmetic, Reading, etc. 
English ability, factorial analysis of, 
136, 139, 142 
tests of, 75, 153, 154, 157, 176-7, 
234, 242, 250 
See also English composition, Read- 
ing 
English composition, marking of, 13, 
21, 25, 27, 28, 29, 31, 34-5, 201-7, 
211, 256 
range of ability in, 65-6 
subjects for, 64, 149, 256 
tests of, 135, 155, 175 
training in, 199, 224 
variations in ability at, 63, 201–2, 205 
Examinations, age allowances in, 53, 
195n. 
as eommercial qualifications, 197-8, 


as pedagogical aids, 32, 198 

coaching for, 199, 208, 210, 212, 223, 
250, 252 

criticisms of, 148-51, 154, Ch. XI, 
223, 226, 250, 252, 254 

ease or difficulty of, 24-5, 33, 63-6, 
230-1, 245, 246, 254, 256 

E of, 154, 197-8, 208, 246, 

improvements in, 253-8 

marking of, 206-7, 220, 227-8, 
256—8; see also Marking of scho- 
lastic work 

new-type, q.v. 

of e numbers of candidates, 53, 

of small numbers of candidates, 35, 
58, 66, 252, 255 

oral, 246-7 

passmarks in, 29, 32, 35, 201, 206, 
212, 258 

reliability of, 2, 128, 200-8, 210, 223, 
227-8, 247, 250, 255-6, 258 
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Examinations (cont.) 
setting questions for, 64, 148, 151, 
220, 227, 253-6 
substitutes for, 246-52 
validity of, 2, 87-8, 151, 208-10, 
2 


Examinations, Baccalaureat, 205-6 

Civil Service, 135, 202, 247 

General Certificate, 2, 135, 197, 
198-9, 202, 206, 210, 250, 252 

Leaving Certificate (Scottish), 198, 
209, 210, 250 

Matriculation, 2, 197, 210 

Qualifying (Scottish), 31, 71 

Scholarship, 108, 206, 208-10, 228 

School Certificate, 27, 203-4 

uc (internal), 53-66, 197, 252. 


Secondary School selection, q.v. 
Special Place, 203, 208 
b x College, 29-30, 58, 135, 


University, 30, 35, 58, 62-6, 148, 
р} 197-8, 201, 202, 203-4, 209, 


F-ratio, 87 У 
Factor patterns, 138, 140-2, 144, 146, 
147, 152 


effect of heterogeneity on, 139 
effect of tuition on, 138, 144 
Factorial analysis, Ch. VIII 
limitations of, 138-40, 146-7 
techniques of, 136—7, 139, 143, 147 
uses of, 20, 130, 135, 138, 144-6, 
152, 168-70, 224 
Factors, common, 120, 137, 141 
general, 131, 136, 137л., 139, 140-4, 
152-3 
group, 131, 136-45, 147, 151, 152; 
161, 169, 195, 226, 250 
multiple, 140-1, 143-4, 146, 147, 152 
Er E 140-2, 144, 145, 150, 
151,1 
Faculties, 20, 130-2, 138, 143, 145, 224, 
2: 


Formal discipline, 199, 253 
Free word association tests, 39 
French, tests of, 178 


Frequency distributions, asymmetrical, 


6n., 15, 17, 31, 104 
bimodal, 15 . 
combining and comparing, 53-62, 
123-4 


curves, area of, 11, 16 

definition of, 4 

dissimilarities in, 54 — 

graphical representation of, 7-11, 
12-17 
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Frequency distributions (cont.) 
ГЕРА 13-15, 18, 24, 29-30, 35, 
90-1, 104 RETEN 
normal, see Normal distribution 
of classified measures, 5-7 
of correlations, 76 
of differences, 76-80 
of means, 1773: 8978 
ngular, 26, 
shew, 15, 17-18, 23-4, 29, 30, 31-2, 
39, 54, 57 а 
statistical methods, for handling, 
Ch. III, 54-8 
tabulation оз 7. ТТЕР 
Frequency polygon, 7-11, 13- 
Function ас оа, 128-9, 137л., 200, 
209-10, 227 


& factor, 142-6, 152, 161, 168, 169 
Gaussian distribution, see Normal 
distribution vee 
Goodness of fit, 50-1, "NS 
Group intelligence tests, application 
of, 68, 162-3, 164, 167, 190-2, 
249-50 
batteries, 142, 156 
choice of, 153, 188-9 
classified list of, 153, 181-4 jus 
compared with individual, 162-6, 
167-70 
ans icon об 152-5; 155-6,188-9 
differential, 169— 
distributions of scores on, 10-16, 32, 
50-1, 81, 83-4, 90-1, 188-9 151 
effects of health and mood on, , 
162, 164—5, 169, 191 oe 
effects of practice and coaching on, 
127, 157n., 165-7, 191, 195 122 
scoring of, 156, 164, 189-90, 191- 
time limits, 132, 156, 163-4, 190 
validity of, 151-3, 167-9, 249 
See also Mental tests 


Handedness tests, 186 s 17 
andwriting ability, distribution of, 
factorial analysis of, 136 
marking of, 25, 29, 34 
tests of, 39, 135, 155, 175 
eterogeneity of populations, 122-3, 

128, 135, 139, 201, 204n., 205, 209, 

Histogram, 7-11, 13 

onours students, i ility of, 
16, 81-2 065» superior ability 


efinition and and, 69-71, 94, 249 


d 
146, 152, 1562710 wa он 
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Intelligence (сот.) 
educational attainments and, 69, 92, 
108, 122, 125, 139, 144, 146, 153, 
156-8, 164, 168-9, 194, 210, 
249-50 
hereditary and environmental influ- 
ences on, 68, 156-8, 167, 249 
language handicaps and, 158 
nationality and, 68, 158 
physique and, 134 
sex and, 16, 76-80, 157n. 
subjective judgments of, 139, 148-50, 
151-2, 196 
vocation and, 144-5, 146, 157, 195 
Intelligence Quotients (I.Q.), distribu- 
tion of, 15-19, 49-50, 52, 72-3, 195 
interpretation of, 19, 22, 36, 71-3, 
193-6 
prediction of, 112-18 
stability of, 126-7, 156-7, 165, 168, 195 
standard deviation of, 52, 71-3, 
157n., 195 
Intelligence tests, see Group, Non- 
verbal, Terman-Merrill, etc., tests 
Interests, factorial analysis of, 141, 147 
intellectual, 134, 139, 141, 144, 146, 
157, 246 
tests, 187 
Internal consistency, 152-3 
International Examinations enquiry, 
203, 205-6 
Interviews, 246-8 


К or k: m factor, 144, 161 


Language development tests, 174, 176 

Letter grades, 11, 26-8, 205 

Linguistic and literary abilities, see 
English, French, Reading, etc. 


Machine-scoring of tests, 189-90 
Manual abilities, factorial analysis of, 
132, 134, 136, 137, 142, 144 

tests of, 132, 135, 186 

Marking of scholastic work, analytic, 

28, 203-4, 207, 211-12, 223, 256-7 

by ranking, 25-6, 27, 28, 30, 32, 58, 
59, 65, 103, 257 

improvements in, 31—5, 58, 59-62, 
256-8 

objective уз. subjective, 11-12, 20-1, 
28-30, 32-3, 35, 65-6,149, 155, 202— 
3, 206-7 ‚211-12, 226-8, 255, 257 

standards of, 27-8, 33-5, 53, 58, 
63-5, 86, 150, 200, 202-3, 205, 
251-2 

teachers' idiosyncrasies in, 6, 27, 29, 
33, 62-4, 202, 257 

See also Ability, Arithmetic, English 
composition, Examinations 
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Marks and scores, combination of, 22, 

26, 28, 53-6, 59, 123-5, 204 

comparisons of, 2, 32, 53-4, 56-8, 
59-60, 69 

distributions of, 4-7, 11, 24, 26, 
28-30, 32-5, 48, 53-8, 59-62, 86, 
257 

scaling of, 33, 35, 56-62, 124, 202, 
230, 251-2, 257-8 

significance of, 24-5, 27-30, 33, 36, 
39-40, Ch. IV, 150, 210 

weighting of, 54-5, 66, 123, 125, 244, 
257 


Mathematics tests, 177, 185, 250 
Mean, arbitrary or guessed, 42-5, 
46-8, 85, 95 
Mean, arithmetic, calculation of, 1, 67., 
38-9, 40-5, 48 
effect of, in combining marks, 54-6 
S.E. of, 82-3 
Mean Variation (M.V.), 45, 48 
Measures, definition of, 3 
Mechanical ability (nr-factor), 130-1, 
144, 145 
tests of, 184 
Median, 36-9, 45, 58 
Memory, 130, 132, 142, 143, 145, 168, 
193, 253 
Mental Age (M.A.), 23, 69, 70-1, 156, 
159, 160, 161, 162, 171-2, 188-9, 
194-5 
Mental deterioration, 145-6, 170 
Mental measurement, 19-20, 194; see 
also Ability 
contrasted with physical, 19, 21-4, 
30, 194, 196, 206 
Mental organisation, 130, 132, 138, 
143, 146-7, 224-5 
Mental speed, 130, 132, 144, 146, 
163-4 
Mental testers, desirable qualities in, 
191— 
training needed, 73, 153, 159, 160, 
162, 164, 190, 193, 196 
Mental tests, classified list of, 153-4, 
170-87 
construction of, 20, 23, 143, 148-53 
further developments of, 169-70, 250 
instructions, 148, 149, 159, 189-90 
inter-relations of, 92, 94, 129, 130-5; 
see also Factorial analysis 
limitations of, 67-73, 126-9, 149, 
151, 152, 159-60, 161-70, Ch. X, 
249-50 
norms, 25, 26, 28, 40, 55, 58, 66-73, 
150, 160, 161, 167, 171-2, 189, 
190, 191-2, 195 
origins of, 148 
prognostic, 151, 250 
publishers of, 171 
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Mental tests (cont.) 
reliability of, 2, 87, 121n.,126-9, 131, 
137, 150, 164, 165, 168-9, 188, 
194-5 
scoring of, 149-50, 159-60, 164, 
189-90, 191-2 
standardisation of, 25, 67-8, 69-70, 
148-50, 153, 159-60, 195 
validity of, 2, 87-8, 118-19, 123-5, 
129, 150-3, 167-9, 188, 249-50 
Multiple factor analysis, 140-1, 143-4, 
147 


ы abilities, factorial analysis of, 
4 
tests of, 186 


New-type examinations, advantages 
Over essay-type, 32, 211-13, 220, 
223-4 


applicability of, 20-1, 35, 66, 219-20, 
228, 239, 243, 255 

best-reason questions, 237-8, 239 

completion questions, 214, 228, 231, 
232-4, 244 

defects of, 151, 21 1, 219-28, 232-3, 
235-6, 240-1 

distributions of marks on, 29-30, 
32-3, 212, 220-1, 227, 230-1 

elementaristic tendency in, 223-5, 
226, 228, 235, 255 

guessing factor in, 220-2, 232, 235, 
236, 237, 244, 258 

irrelevant clues in, 225-6, 233-4, 
236, 240-1, 244 

matching questions, 218-19, 231, 
238-42, 244 

multiple-choice questions, 215-17, 
220, 222, 225, 228, 231, 234п., 235, 
236-42, 243, 244 

practical, 243 

principles of construction, 32-3, 153, 
пт 220, 224-6, 221-8, Ch. 

I 


rearrangement questions, 219, 242-3 
reliability of, 210, 221-2, 226-8 
scoring of, 220, 234, 242-3, 244 
simple recall questions, 212, 213-14, 

221, 226, 231, 232-4, 239-40, 244 
specimen paper, 213-9, 258 
suggestive effect of false Statements 

in, 222 
true-false questions, 214-15, 220-2, 

228, 231, 234-6, 244 
validity of, 151, 225, 226, 228, 245 
weighting of marks, 244 

Non-verbal intelligence tests, classified 

list of, 153 181-2 
factor analysis of, 144 
uses of, 158, 163, 250 
See also Performance tests 
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Normal curve of error, see Normal 
distribution _ , 
Normal distribution, algebraic equa- 
tion of, 49 по 
applicability to reliability, 76-80 


applicationto test construction, 23-4, 


31, 65, 188 


applications to school 


257 
definition of, 17 | 
disturbing factors in, 17-18 
examples of, 17 
ordinates of, 108 i 
testing conformity to, 50-1, 90- 
Norms, see Mental tests 
Null hypothesis, 88, 89, 105 


Numerical description of distributions, 


36-7, 39-40, 49, 56 __ 
Numerical evaluation of abilities, see 
Ability 


Ogive graph, 7n. 

Omnibus tests, 156, 190 | — 

One-tail and two-tail significance te 
79n. 

Oral group tests, 163, 181 


Passmarks, see Examinations T 
Pearson-Bravais method, see Correla 
tion А 
Percentage marking, defects of, 28 e 
see also Ability, numerical eva 
tion of 
Percentages, SR. "s 5% 86, 90 
Percentile graphs, 56- 
Percentiles, 36-7, 38-40, 49, 58, 156, 
171,189,194 — — 
comparison of distributions by, 
56-61, 65 
sizes of units, 59, 67 кеа 156 
Performance tests, application of, 93 
158, 160-2, 169, 180-1, 189, 1 
classified list of, 153, 180-1 
construction of, 148-50 
factor analysis of, 144 
improvement on, with age, 70 
limitations of, 126, 161 
Scoring of, 21, 149, 155, 161, 189 
Perseveration tests, 104 
Personality, Study of, 141, 147, 161, 
169, 193-4, 246, 248; see also 


Character, Те 
tests, 154, 186-9 Perament 


hysical abilities, distributi 

dist 17 
Measurement of др " 
tests of, 132, 18 

nions, distribution of, 
Scaling of, 19 


marking, 
17-18, 24, 26, 31-5, 55, 59-62, 205, 
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Practical ability, 134, 136, 144, 151, 
161, 168, 193 
factor (F), 144n., 161 
tests of, 187 
Practical examinations in science, 243 
кашса teaching ability, 136, 146, 
9 


Practice, effects of, on intelligence 
tests, 127, 157п., 165-7, 191, 195 

Practice sheets, 156, 189, 191 

Probability integral tables and graph, 
49-52, 59, 60-2, 77, 79 

Probability theory, 51-2, 63-4, 77-80, 
109, 221 

Probable Error (P.E.), 78-80, 87n., 
109, 204; see also Reliability 

Product-moment method, see 
Correlation 

Product scales, see Quality scales 

Projective techniques, 187 

Pupils' records, see Cumulative record 
cards 


> 


Quality scales, 34, 149m., 154-5, 175 
178, 254 

Quartiles (Q, and О), 36-8, 40, 54, 56, 
58, 79 


Quota scheme, 251 


Range of measures, 1, 6, 36, 40, 45, 48, 
51, 55, 66; see also Dispersion 

Ranking of measures, 4, 103-4, 195; 
see also Marking 

Rank-order method, see Correlation 

Rapport in testing, 165, 191-3 

Rate and power plans, see Ability 

Reading ability, factorial analysis of, 
134-6 


marking of, 29, 56 
` tests of, 21, 22, 75, 135, 151, 153, 
173-5 
Reading Age, 69 
Reasoning abilities, 143, 146, 157-8, 
223-5, 253 
tests of, 179, 186 
Recognition type items, 144, 167, 169, 
212, 222, 225-6, 232, 234-43 
Regression, 113-16, 123, 133 
equations, 115-16 
multiple, 125 
non-linear, 104-5 
Reliability, coefficients of, 126, 129 
effects of function fluctuation, 128-9 
effects of heterogeneity, 128 
effects of length of test, 126, 127-8 
formulae for, 81-5, 109-10, 117, 
126-8 
of correlations, 74, 76, 81, 87, 
109-10 
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Reliability (cont.) 
of differences, 2, 74-5, 76-82, 84-5, 
86 


of means, 74-5, 81, 82-3, 84 

of percentages, 84, 86, 90 

of predicted scores, 87-8, 117-19, 
123, 126-7, 157 

of scores, 87, 126-7, 194 

of small samples, 85 

of standard deviations, 83-4 

of z, 110 

theory of, 74-80 

See also Examinations, Mental tests 


Sample test items, 156, 189, 191, 244 
Samples of the population, unselected, 
17, 67-70, 74-5, 80, 122-3 
Sampling errors, 63, Ch. V, 105, 
109-10, 126-7, 128-9, 1377., 138, 
142, 195, 200-1 
Sampling of ability, see Ability 
Scaling, see Marks and scores 
Scattergram, 92-4, 98, 101, 104, 105, 
112-4, 117, 133-4 
Scatter of measures, see Dispersion 
Scholastic abilities, factorial analysis 
of, 92, 134-6, 142, 144 
tests of, see Educational, Arithmetic, 
etc., tests 
Scientific approach to psychological 
problems, 3, 74, 76, 78, 86, 130 
Scotland, test norms in, 68 
Secondary education, selection for, 
31, 71, 92, 119-20, 122, 124-5, 
154, 163, 164, 165—7, 195n., 197, 
198, 199, 205, 209, 223, 224, 228 
Semi-interquartile range (Q), 40, 45, 
› 179 
Sensory-motor tests, 186 
Sex differences in ability, 2, 15-16, 75 
76-81, 83-4, 157n. 
in cinema attendance, 89-90 
in milk-drinking, 84 
Simple structure, 141 
Skew distribution, see Frequency 
distribution 
Small sample techniques, 85 
Spatial judgment (k-factor), 143, 144, 
1 


tests of, 184 
Spearman-Brown prophecy formula, 
127, 201 
Speed factor in ability, 130, 132, 144, 
146, 163-4, 168, 169 
Spelling ability, factorial analysis of, 
134-6 


measurement of, 22, 25, 26 
tests of, 21, 135, 175, 234 
Spread of measures, see Dispersion 
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Standard Deviation (S.D. ог в), 36, 40, 
45-52, 87 
differences between, 83 
of averaged measures, 123-4 
use of, in combining marks, 54-6, 
59-62, 65 
Standard Error (S.E.), 78, 87n., 88; 
see also Reliability 
Standardisation, see Mental tests 
Standard scores, 55-6, 59-62, 65, 67, 
69, 73, 115, 123, 156, 171, 194 
Standards of marking, see Marking of 
scholastic work 
Statistical methods, definition of, 1,2-3 
limitations of, 2-3, 80, 138-9, 146-7 
objects of, 1-3 
Statistical significance, 79-80; see also 
Reliability 


1, 85, 87, 110 
Tabulated measures, average of, 40-5 
correlation between, 98-102 
percentiles of, 38 
standard deviation of, 47-8 
Tabulation of measures, 1, 3-7 
Television viewing, 105-7 
Temperament tests, 19, 126, 161, 186-7 
factor analysis of, 141, 144, 146 
Tetrad difference criterion, 142 
Time limits in testing, 20-1, 132, 156, 
163-4, 171, 190 
Training College students, abilities of, 
39-40, 75-6, 80, 86, 135, 136, 139, 
155, 249 
Trustworthiness of statistical results, 
see Reliability 
Two-factor theory, 141-5 
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Unitary factor patterns, 141-2, 147, 
152 


United States, see American . of 

University education, selection. о 
students for, 2, 122, 135, 197, 198, 
208-10, 228, 247, 248-9, 252 


idi inati ]tests 
Validity, see Examinations, Mental tes 
Variability, see Dispersion, Reliability 
of attainments, 1-2, 17, 62-3, 129, 
131, 135, 200-2 

Variables, categoric, 88-90, 104-7, 108 
definition of, 3 50 
discrete and continuous, 3, 5, 
traits and abilities as, 19 

Variance, 87 

Verbal ability (v or v: ed factor), 143, 


144, 145, 168, 193; see also 
English ability И f 
Verbal tests, see Group intelligence 


tests mE zi 

Viva voce, see Examinations, ora Loss 

Vocabulary, see Language develop 
ment, Stanford-Binet, etc. — , of 

Vocational abilities, factor analysis 0% 
136-7, 145 

Vocational guidance, 144-5, 154, 163, 
170, 248, 

Vocational selection, 119-20, 125, 135, 
145n., 161, 210, 247-8 126 

Vocational tests, 55, 118-19, 125, , 
151, 154, 184-6, 189 


War Office Selection Boards, 248 
Weighting, see Marks and scores 


X-factor, 144 
z-techniques, 110-11 
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cover a different part of the ground which we wish to pre- 
dict. It has often been claimed that selections based on 
such tests would be fairer than selections based on examina- 
tions, since they eliminate the factors of coaching and cram- 
ming, and reveal the pupils’ real abilities rather than their 
schooling. Unfortunately, as shown above (pp. 191-2) 
coaching for intelligence tests is only too easy and too 
prevalent. Nevertheless, unpublished tests are available 
Whieh can usefully supplement, though hardly replace, 
examinations. Beyond the age of 15 or so, the significance 
of intelligence tests is so doubtful that they would probably 
be better applied by no one except trained vocational and 
educational psychologists. у 
Cumulative Record Cards—The primary function of 
many examinations is to provide a certificate of work done. 
Ostens'bly this is true of the School and Leaving Certifi- 
cates, and of university degree examinations, though they 
are also often interpreted prognostically. Now, in most 
schools the teachers know quite well what work each pupil 
has done. Hence there is every reason for making more 
use of teachers? records of this work and less use of external 
examinations, The former cover the whole ground, the 
latter never more than a fraction of it. The former are 
collected under everyday conditions, the latter when many 
pupils are in an abnormal emotional state due to cramming 
and fear of failure. The former are also less likely to pro- 
duse the degrading effects upon teachers’ and pupils 
educational values which are characteristic of publie com- 
petitive examinations. "Teachers? markings of school work 
are, of course, just as liable as examiners’? marks to 
subjectivity and unreliability ; perhaps more so, because 
of the prejudices they are likely to possess regarding each 
pupil’s character and ability, which may influence their 
judgment of his written work. But if in the course of his 
career a pupil is taught by several different persons, each 
of whom records his achievements according to some 
uniform system, the cumulative record should yield a fairly 
reliable picture. Although, as stated above, the function 


os 
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of such records should be to certify work done rather than 
to predict future promise, yet the investigation by the . 
Scottish Council for Research in Education (1936) into the 
Leaving Certificate showed that teachers’ marks did in fact 
correlate with subsequent university success just about as 
well as did the external examiners' marks. The many other 
uses to which cumulative records may be put, such as 
vocational and educational guidance, have been fully 
described by Hamley and his collaborators (1937). 

The objection which will at once be raised to such 
schemes is—how are the standards of different schools to 
be compared ? The best pupil from one school might, in 
an open competitive examination, be found below the 
average pupils of certain other schools. Two solutions will 
at least partially meet this difficulty. Valentine (1938) 
proposes that љЉ па general intelligence test should be 
applied to all the schools under one authority. From this 
may be deduced the average level and range of brightness 
in each primary school. The available places in secondary 
schools may then be apportioned to the primary schools, 
the greatest number being awarded to those schools known 
to possess the largest quantity of bright pupils. Finally, 
each head master may allot these places to the best pupils 
in his own schools, Оп the basis of their intelligence test 
results and their cumulative records of work. ‘An alterna- 
tive scheme would be for the secondary schools to tabulate 
for a considerable period the final marks which the pupils 
had received at their respective primary schools. They 
would also record the actual performance of these pupils 
during their secondary school careers. From these data it 
would soon be possible to determine the relative standards 
of all the primary schools from which the pupils were 
drawn. It would be ners бау the papi кү Es 

;kely to be just as good as the pup! 
from school А way B Hence the numbers of pupil 


л о schoo 
with 80% from m each school could be deduced. 


Jection fro 
ВА ed already widely used at the university level 


in America. 
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. Either of these methods would, of course, require а com- 
petent statistician to run it. They would not, as they 
stand, be appropriate as substitutes for the School ог 
Leaving Certificates or Matriculation examinations. Yet 
in these instances, too, the relative levels of all schools 
which submitted fairly large numbers of candidates might 
be established by means of brief, scientifically conducted 
attainment and intelligence tests. On the basis of these 
results each school could be informed as to the number of 
certificates to which it was entitled. And the respective 
head masters would be left to decide, from school records, 
which pupils should be given these certificates. 


IMPROVEMENTS IN ORDINARY EXAMINATIONS 


Since external examinations will for a long time play а 
large, ‘hough we ћбре diminishing, part in education, and 
Since even in an ideal System teachers must apply internal 
examinations, and mark their pupils’ work, we must con- 
sider carefully the improvements which might be made in 
the examination as it now is. From the investigations we 
have described, and from the discussion in previous chap- 
ters, the following principles may be deduced. 

1. The teacher or examiner should formulate as precisely 
as possible what it is that he wants to measure. This 
implies first a clear conception of the objectives or aims to 
which, in his view, education should be directed. At the 
primary school level he may be content with the view that 
education consists in implanting certain traditional know- 
ledge in children's minds, such as the way to work out 
sums and ta parse sentences, This may be one objective 
in the Secondary school and university also, though surely 
not one of the most important. History, science, foreign 
languages, and the like do involve the acquisition of а 
certain body of facts—dates, chemical formule, meanings 
of words, ete.; but they are supposed to possess many 
additional values, "Taste for literature, interest in foreign 
peoples, scientific habits of mind, ability to apply know- 
ledge to the interpretation of present-day events, are but à 
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few of the things which we hope to inculeate. In making 
such an analysis, the teacher must be careful to take into 
account the facts which psychologists have established in 
ansfer'of training, namely that general facul- 
ties, like observation, memory, and reasoning, will not be 
developed merely by the study of some particular school 
subject, and that indeed these faculties have little real 
existence. Forexample, a rational and scientific approach 
to contemporary problems can be developed by studying 
typical problems rationally, by conducting discussions and 
projects, by showing how irrationalities of thinking arise, 
and so on; but not by the carrying out of stock experi- 
ments in physics, nor by memorizing the rules of grammar 
or logic. A Behavioristic formulation should be helpful : 
instead of talking in terms of vague mental qualities, we 
should ask—what can the educated pupil or student, do in 
each specific field of study, or in the world at large, which 
the uneducated one cannot do; and what are the precise 
elements of our teaching which are supposed to confer 
upon him these capacities? — S 

Having formulated as candidly as possible our general 
philosophy of education, we can ask which of these objec- 
tives are to be examined. We may decide that desirable 
moral conduct and social adjustment cannot be expressed 
in written work. But each objective should be considered 
from the point of view—what can the educated pupil write, 
or what problems can he solve, which the uneducated one 


и examiner, who has not been responsible for 
teaching the examinees, should adopt а similar procedure 
in deciding what his examination 1s meant to measure. 
He needs, however, to be especially careful to emphasize 
© can ? rather than * ought ’ in his analysis. It is merely 
futile for him to complain that teachers are not doing their 
work properly, and to insist that examinees must conform 
attempt to assess these qualities and 
ulative records cards (cf. Hamley, 


regard to tr 


er, 
1 Teachers should, however; 
record them on the pupils cum 


et al., 1937). 
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to intellectual standards which may be current among 
persons like himself, but which are quite Ss 
by children or immature students. His job is to set tas à 
which will differentiate the better pupils who exist in real, 
not ideal, schools, from poorer ones. If he is supposed to 
pick out, say, the best 80%, but formulates RUM 
which even the best 1% can hardly achieve, then he is no 
doing his work efficiently and is creating unnecessary 
difficulties for himself when the time for marking Аш 

A group of persons, such as а head master and his sta : 
ог а board of examiners, should be able to carry out suc 
analyses much more logically and thoroughly than can ee 
one person, however clearheaded he may be, since they 
сап criticize one another's ideas. 

Finally, the conclusions of this analysis should be com- 
municated by the examiners to their prospective з 
amineés, or by the teachers to their pupils. Atthe re 
time examining frequently degenerates into „а battle 
between the examiners who try to catch out their victims, 
and examinees who iry to find out what their examiners 
really want, and what are their personal prejudices, by 
conning the papers they have set in the past. АП this is a 
stupid waste of everybody’s time. , 

2. Having decided upon the abilities that he wishes to 
measure, the teacher or examiner should consider the form 


of examination which will be most appropriate for each of 
tbem. French conversation 


or a mixture, Ту 
nection are the 
own skill in c 
numbers are very small, he may be justified in shirking the 
ure of time required by objective examining. 
And he should know from past experience that he is more 
Successful in devising new-type measures of some things 
than of others, 
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Generally speaking, any kind of factual information, or 
straightforward skills, such as the carrying out of arith- 
metical operations or grammatical analyses, should be 
examined in new-type form, instead of being left to the 
vagaries of subjective marking. This need not mean that 
only trivial and unrelated items of knowledge or skill 
should be included. Good questions, as we have seen, can 
bring out the examinees’ ability to interpret the significance 
of facts, grasp of general principles, and application of such 
principles to specific new problems. True, there are many 
dangers in these more elaborate items, but they are 
hardly as serious as the dangers of trying to discover all 
the underlying abilities of the examinees from their 
answers to essay questions. 

We admit, however, that many important aspects of 
ability may be displayed more clearly in essay-type answers 
or (with subjects such as mathematics and physics) in the 
working out of long and complex problems. Literary style 
of composition, imaginative qualities and originality, 
evidence of the examinee's interest in the subject and of 
his own thinking about it, are some of the things which the 
examiner may wish to elicit, even if he can never hope to 
measure them perfectly reliably. If the examination is 
designed primarily to bring out qualities such as these, 
then the examinees should be told so, and should be 
informed that amount and accuracy of information will 
not, in this instance, be taken into consideration. Frank 
directions about what is wanted are particularly desirable 
when the same examination includes some new-type and 
some essay-type questions. j 

type questions 


8. If, for one reason or another, essay- 
sork, then the examiner should 


have to be maids-of-all-w 

ask himself which parts of the total field to be examined 
are the most important. Knowing that his questions can 
sample only a fraction of it, he should avoid the less essen- 
tial parts, even at the risk of his questions being easily 
* spotted > beforehand. Overlapping among the questions 


is obviously undesirable also, since it is wasteful to cover 
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any one part twice over. With each question, the length 

‘of time needed for a reasonable answer should be con- 
sidered, so that the examinees shall not be set too much, 
nor too little, to do in the period allowed. Finally, 
although it may not be possible to increase the total 
examination time, or the number of papers, yet the 
examiner should keep in mind the unrepresentativeness of 
single papers, and the desirability of improving reliability 
by such increases. 

4. Perhaps the most important quality in a good ex- 
aminer is his ability to foresee the sort of answers he will 
get, both through his insight into the minds of pupils or 
students, and through his experience of their answers in 
the past. If in formulating each question he possesses & 
clear conception of the probable answers—good, average; 
and bad—the following advantages are likely to accrue. 
The questions will be simply worded, free from ambiguities, 
and they will not be interpreted in such divers ways that 
the answers are too heterogeneous to be compared, or 
marked consistently. The questions will also be аррго- 
priate in difficulty; not, that is to say, the type of questions 
which no one but the examiner himself could answer 
adequately. Most important of all, the production of 
answers to these questions will really bring into play the 
abilities which the examination is aiming to measure. 
When, later on, the examiner reads and marks the answers, 
be should continually be trying to improve this ‘ faculty,’ 
by noting how far his expectations are fulfilled. Every 
question may be regarded as an experiment in forecasting 
which, although never likely to be completely successful, 
may become progressively more so with practice. 

In this, as in all other branches of examining, two 
opinions are likely to be better than one, three than two. 
A board of four or five will perhaps be the most efficient. 

5. When essay-type questions call for answers of а 
largely factual nature, a detailed scheme of marking should 

е drawn up. This is especially necessary when several 
different examiners each mark a certain proportion of the 
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papers. Experiments indicate that it will reduce their 
variability by roughly one-half. But a single examiner 
will also greatly improve his consistency if he decides 
beforehand just what points to look for, and marks these 
objectively. 

In assessing the more general qualities which essay 
answers are supposed to bring out, the quality scale method 
(cf. p. 40) should be adopted. The examiner should 
always be on the alert to ensure that he is judging the 
scripts solely on the basis of the criteria which he has pre- 
viously formulated, that he is not being biased by bad 
handwriting or spelling, by some trick of style which jars 
on him, or by some personal predilection regarding the 
subject-matter. All marking which cannot be done 
objectively should consist largely in introspection : “ Why 
does this strike me as good or bad ? Are these the qualities 
which I am supposed to be judging, and are they the 
qualities which I took into account when I selected my 
standard specimens ? ” 

Many questions may 
by the qualitative and 


advantageously be marked both 
by the objective (count) systems, 
the figures then being combined in due proportion. Iftwo 
or more examiners mark the same, or even a few of the 
same, scripts, it will be particularly valuable to analyse the 
reasons for any discrepancies in marks, so as to clarify 
still further their criteria of good and bad. 

One other source of prejudice often arises in marking if 
the examiner is personally acquainted with some, or all, of 
the examinees. Either he should carefully avoid looking 
at the names on the papers, or, if he cannot help recognizing 
their writing or style, should do his best to disregard any 
previous impressions He may have formed about them. 

6. With the statistical aspects of marking we have 
already dealt in Chaps. ILIV. But the following main 
points in subjective marking (as distinct from count 
scoring) may be recalled. Marking is first and foremost 
the arranging of scripts in order of merit, or grouping them 
into categories of merit. It is not an attempt to give an 
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‚ absolute judgment of the value of each script. АП the 
answers to one question should be classified into such 
categories before starting on another question. The cate- 
gories should be, approximately, evenly spaced, and should 
not exceed ten in number. The frequency distribution of 
marks for each question should tend to the normal shape, 
and the average mark should be very close to 5 (assuming 
the middle category of merit to have been labelled 5). Тһе 
dispersion of the distribution should also be approximately 
constant for all questions, unless it is desired to weight some 
more heavily than others. The final total scores for the 
scripts may be converted into an appropriate distribution 
of percentage marks by any of the three methods given in 
Chap. IV. АП decisions as to standards of passing and 
failing should be postponed until the process of marking 
has been completed. 

Finally, we would register a plea for the elimination, 
whenever feasible, of the harsh system of expressing 
examination results in only two, three, or four categories 
(Pass or Fail; First, Second, Third Class, etc.). Many more 
degrees of merit in the total marks can and should be recog- 
nized. If the system must be retained, then all those 
scripts which fall within, say, 5% above and below each 
borderline should be read a second time, preferably by 
more than one examiner, so as to ensure that the final 


decision shall attain the utmost reliability of which human 
fallibility is capable. 


ANSWERS TO THE EXAMINATION ON Pp, 240-6 


1. 19. 8. 17 (or 15—16). 
2. 6. 9. positive (or plus). 
8. 150. 10. faculties, 
4. 105 (104 ог' 106 may 11. Spearman. 
also count). 12. g. 
5. 80. 18. specific. 
^ 6. 21. 14. g. 
7. 82 (or 81). 15. group (or common). 


IMPROVEMENTS 
F (or k). 
+. (In Nos. 17-25, 
. +. subtract one mark 
—. for each incorrect 
+. answer. Мо guess- 
—. ing correction need 
—. be applied in sub- 
—. sequent questions.) 
. +. 
C. 
C. 
. None 
A. 
. B. 
. 69 or 62. 
. —4or — 8. 
. C. 
„р. 
В. 
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86. D. 

87. B. 

88. C. 

89. A. 

40. None. 

41. D. 

42. E. (In Nos. 42-47 one 
48. D. point is scored by 
44, CF. each answer in 
45. А. which one ог both 
46. CA. letters are given 
47. EC. correctly. Noth- 
48. D. ing is scored by 
49. B. any answer which 
50. G. also includes an 
51. А. "incorrect letter.) 
52. E. (Forscoring of Nos. 
58. Е. 48-54, see р. 278.) 
54. C. 
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Dictation, see Spelling 


SUBJECTS 808 
Dispersion of measures, 2, 18, 43, 
64, 69, 74, 78-9, 96, 101, 126, 
142-3 
indices of, see Mean Variation, 
Standard Deviation, 


Range, 
etc. 

Distractions in testing, 213, 214, 
215-6 


Education, objectives in, 221-2, 
284-5 
Educational ability, general factor, 
151-8, 159-60, 161-2 
Educational Age (E.A.), 82, 84-7, 
179, 210-1. 217-8 
Edueational Quotient (E.Q.), 85, 
218 
Educational tests, 38, 177-80, 
198-202, 255, 281 
choice of, 177, 210-1 
construction of, 24, 210-1 
correlations between, 155-6 
functions of, 178-9 „ 


norms, 79-87, 179, 217-8 
reliability of, 145, 147, 156n., 174, 
255 


validity of, 174-5, 255, 281 
See also Arithmetic, Reading, etc. 
English ability, factorial analysis of, 
156-8, 162, 165, 167 
tests of, 89, 178, 198-202, 273, 
281 
See also English composition, 
Reading, etc. 
English composition, characters of 
writers, 125 0 
marking of, 14, 94, 30, 32, 35, 
40-1, 226, 228, 229-30, 232, 
237 


subjects for, 172 
English composition ability, range 


of, 78 
tests of, 179, 198, 199n. 
training in, 223, 252 
variations in, 74-5, 226 
Enumeration scores, see 
scores 
Examinations, Baccalauréat, 930-1 
Civil Service, 157, 277-8 
Leaving Certificate (Scottish), 
157, 220, 234—5, 236, 282, 283, 
284 
Matriculation, 3, 220, 235, 284 
Qualifying (Scottish), 37, 84, 105, 


143 
Scholarship, 121, 234-5, 257, 281 


Count - 
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Examinations, School Certificate, 3, 
32, 157, 222, 227, 228, 229, 
231, 235, 236, 282, 284 

School (internal), 67-8, 69, 71-3, 
76, 172, 174, 220, 282-3 

Special Place, 37, 84, 105, 143, 
149, 200, 222, 225, 228, 229, 
231, 234, 280, 281 

Training College, 35, 69, 156, 280 

University, 36, 42, 69, 74, 79, 
172, 173, 174, 221, 222, 225, 
227, 228, 229, 234—5, 281, 
282 

Examinations, age allowances in, 

6 

as commercial 

220-1, 236 

as pedagogical aids, 37-8, 221 

bad effects of, 84, 222-8, 251, 282, 
286 Я 

coaching for, 222, 233, 236 

criticisms of, 172-5, 222-87, 239, 
282, 286 

ease or difficulty of, 28-9, 39—40, 
64, 75, 76, 77, 172, 259, 205-6 

fonstions of, v, 178, 220-2, 234, 
23 

_ improvements in, 236, 284-90 

marking of, 231-3, 247, 256, 288- 
90; see also Marking of 
Scholastic work o 

new-type, q.v. 

of large numbers 
63, 247, 288-9 

of small numbers of candidates, 
42, 69n., 79, 286 

oral, 220, 276-9 

pagsmarks in, 34, 36, 37, 42, 
«0-6, 229, 230-1, 238, 290 

reliability of, 9, 145-7, 3 

7 235, 236, 251, 256-7, 288, 290 


Setting questions for, 75, 172, 
174, 247, 256, 284-8 i 


substitutes for, 276-84 
validity of, 3, 101-2, 138, 174-4 
283-6, 257, 28; p 


qualifications, 


of candidates, 


Factor patterns, 160-4 
- effect of age, 167-8 
effect of heterogeneity, 101 
effect of tuition, 160—1 
Factorial analysis, Ch. VIII 
limitations of, 160-2, 169-70 
techniques, 158n., 170 


"uses of, 23, 150-1 
185, 1959, 502 188-9, 176, 


SUBJECTS 
Factors, common, 158, 163-4 _ 
general, 152, 157, 158n., 159, 
161-3, 164-7 


oup, 152, 158, 159, 161-2, 16 
ur 166-9, 174, 177, 182, 186, 
188, 194, 195, 218 ECT aa 
multiple, 163-4, 166—7, 7 
старе 159, 161, 162-7, 174, 186 
Faculties, 23, 150-3, 160, 166-7, 
252, 285 бәз. 955 
Formal discipline, 223, 285 
Free word socis don tests, 47, 209 
Frequency distributions, asymme- 
trical, 7n., 17, 20, 37 
bi-modal, 17-18 
combining and comparing, 63-74, 
142-3 
curves, aroa of, 12, 18 
definition of, 4—5 
dissimilarities in, 64 13 
graphical representation of, 8-19, 
14-19, 67-9, 71-2 02 
irregular, 14-17, 20, 28, 31, 34, 102 
normal, see Normal distribution 
of classified measures, 6-8 
of correlations, 90-1, 101 
of differences, 91-5 
of means, 20, 90-1, 98 
rectangular, 80, 64 Р 
skew, 17, 20, 27-8, 34, 36, 
47, 64, 68 8 
statistical methods for handling, 
Ch. III, оста в 
tabulation of, 5- 
Frequency polygon, 8-19, 14—16, d 
Function fluctuation, 148-9, 22% 
256 


9 factor, 164-8, 176-7, 182, 180, 
193—5 


37-8, 


g saturation, 165-6 
Gaussian distribution, sce 
distribution 4 
Goodness of fit, 60-1, 102—4, а 
Group intelligence tests, applicatio 2 
of, 186, 190, 193, 212-4, 279- 
batteries, 165, 180-1 
choice of, 177, 210-2 
classified list of, 178, 204-7 96 
compared with individual, 186- 
construction of, 180-8, 211-2. 
distributions of, scores on, 12-2 + 
38, 60-1, 96, 99, 102-3, 210- 
effects of health and mood on» 
190—1, 195, 214 
effects of practice on, 191-3. 
213-4, 282 


Normal 


INDEX OF 


Group intelligence tests, improve- 
ment on, with age, 84 
scoring of, 189, 211-2, 214-5 
time limits, 181, 187-9, 213 
validity of, 176, 191, 193-5, 280 
See also Mental tests 


H.M. Inspectors, attitudes to school 
marks, 29, 39 
Handedness tests, 208 
и: ability, distribution of, 
factorial analysis of, 156-8 
marking of, 30, 35, 40 
tests of, 47, 179, 198, 200 
Head masters, control of examina- 
tions by, 29, 39, 67, 71-3, 143, 
220, 283-4, 286 , 
Hoterogeneity, 140-1, 148, 161, 
235, 280n. 
Histogram, 8-18, 14-15 
Honours students, superio? ability 
of, 18-19, 97, 235 


Intelligence, age and, 83-5, 108, 
167-8, 186-7, 280-1 
definition of, 23, 150-1, 165, 
175-7, 181, 194, 279 
economic status and, 81, 83, 89, 
139, 195 
educational lovel and, 62, 83, 105, 
121, 130, 139, 140-1, 161-2,168, 
177, 181-3, 218, 236, 279-82 
innateness of, 181-2, 191 
nationality and, 81, 88, 182 
physique and, 155 
sex and, 18, 91, 94, 99, 195 
subjectivo judgments of, 171-3, 
174, 175-6, 219 
vocation and, 168-9, 236 |. 
Intelligence Quotient (1.Q.), distri- 
bution of, 17-18, 21, 59, 60, 62, 
85-7, 218 
from group t 
218-9 
from Terman-Merrill 
184, 192, 194 
in secondary schools, 141, 148 
interpretation of, 22, 23, 26, 85-7, 
181, 216, 217-9 
rediction of, 130-7, 146-7 
Intelligence tests, see Group, Non- 
verbal, Performance, etc., tests 
Interests, factorial analysis of, 162, 
164, 167 
intellectual, 155, 162 
Internal consistency, 176-7 


ests, 86, 181, 193, 
test, 86, 
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International Examinations En- 
quiry, 228-9, 230-1, 277-8 

Interviews, 276-9 

Letter grades, 13, 31-2, 230 

Literary abilities, see English, Read- 
ing, etc. 


Manual abilities, factorial analysis 
of, 156-9, 165, 167-8 
tests of, 153, 208 
Marking of scholastic work, analy- 
не, 34, 228-9, 232, 238, 251, 
258, 288-9 и 
by ranking, 23, 80-1, 32-3, 38, 
70, 71, 76-7, 117-8, 289 
improvements in, 37-42, 69, 71-4, 
289-90 
objective vs. subjective, 13-14, 
24, 34-6, 38-9, 42, 78, 173, 227, 
231-3, 237-8, 256-7, 289 
standards of, 31-3, 34-6, 39-42, 
64, 69, 74-6, 173-4, 224, 226-7, 
228, 230, 283, 290 
teachers’ *diosyncrasies in, 8, 32, 
34, 39, 74-5, 226 
Sec also Ability, Arithmetic, Eng- 
lish composition, Examina- 
tions 
Marks, distributions of, 5-8, 13, 28, 
31, 34-6, 37—42, 57, 63-9, 71-4, 
126, 147-8, 259-60 
Marks and scores, combination of, 
25, 30-1, 33, 63-6, 70-1, 
142-5, 105-6, 229 
comparisons of, 2, 63-4, 65-9, 
70-1, 82, 84-5, 87, 133 
significance of, 29-30, 32, 33, 
35-6, 39-40, 43, 47-8, Ch. IV, 
173-4, 236 
standard scales of, 71-4, ТТ, 87 
weighting of, 65-8, 78, 142, 144 
Matching method, 125 
Mean, arbitrary or guessed, 50-3, 
55-7, 108-9 
Mean, arithmetic, calculation of, 


effect of, in combiningmarks, 
S.E. of, 98 
Mean Variation (М.У.), 53-4, 57 
Measures, definition of, 3 
Mechanical ability, factorial analy- 
sis of, 151-2, 167, 168 
tests of, 208 


Median, 44-7, 48, 53, 69 
195, E 


Memory, 150, 152-8, 165, 106, 


216, 285 
Mental Age (М.А.), 26-7, 82, 84-7, 


2, 46, 48-53, 56-7 
64-6 > 


806 


181, 184, 186, 187, 189, 194, 
210—1, 217-9 
Mental measurement, v-vi, 22-3, 
216; see also Ability 
contrasted with physical, 23, 
25-9, 31, 36, 217, 219, 232 
Mental organization, 150-1, 153, 
160, 166-8, 251-2 
Mental speed, 153, 188 M 
Mental testers, desirable qualities 
in, 214-7 
training needed, vii, 87, 177, 183, 
187, 189, 212, 919 
Mental tests, classified list. of, 177-8, 
196-209 
construction of, 23-4, 27-8, 165- 
6, 171-7 
further developments of, 195—6, 
281 


instructions, 171-2, 173, 183, 197, 
211, 212, 213 

inter-relations of, 105, 107, 149, 
151—7; see also Factor analysis 

limitations of, vii, 90-7, 145-6, 
149, 172, 178, 114—5, 184—5, 
186-96, Ch. X, 279-81 

norms, 29-30, 31, 33, 48, 79-87, 
89, 173-4, 184, 185-0, 193, | 
211, 213, 214, 218 “у Г 

origins of, v-vi, 171 

prognostic, 281 f 

publishers of, 197 " 

reliability of, 2, 101, 139n., 145-9, 
151, 159, 174, 189, 190, 210, 217 

Scoring en 172-3, 184-5, 189, 

‚ 214-6 


е 


standardization of, 29-30, 79-81, 
82-3, 171-8, 177, 218° - 
validity of, 3, 101-2, 138, 143-5, 

174-7, 210, 279-82 
Multiple factor. analysis, 163-4, 
‚166—7, 170 
Musical and artistic abilities, fac- 
torial analysis of, 156-8, 165 
tests of, 179, 198, 208 


New-type examinations, advan- 
дегез over essay-type, 38, 237- 
ә. 


, 


2305 252, 257, 
completion. questions, 240, 94 
261-4, 275 ; ; 


defects of, 246-57, 262... 
БА 3, 265-6, 


distributions of marks on, 35, 39— 
40, 238, 248, 256, 259-60 


applicability of, 24, 42, 239, 247, 
286-7 , 


INDEX OF SUBJECTS 


New-type examinations, element- 
ало tendency in, 250-3, 254, 
265, 270, 287 
guessing factor in, 247-50, 265, 
267, 275, 291 б 
irrelevant clues in, 254, 263, 266, 
271-2, 274 
matching questions, 240, 245-6, 
261, 267, 269-72, 275 4б 
multiple-choice questions, Es 
'242–4, 248, 249, 250, 253, 260, 
261, 287-72, 275 mee 
rinciples of construc , 
R 231-9, 247, 253, 256—7, 258-01, 
274-5 , 
rearrangement questions, 
246, 273-4 
reliability of, 238, 249, 254, 255-7 
всогіпр об, 247, 263—4, vom 
Simple recall questions, 238, 2 
1, 249, 254, 981-4, 275 i 
specimen paper, 239-46, 200- io: 
suggestive effect of false sta 
ments in, 250 
true-false questions, 240, э, 
` 248-9, 250, 260, 261, AET 
validity of, 175, 250-5, 257, 
weighting of marks, 275 ЖУП 
Non-verbal intelligence tests, clas: 
fied list of, 178, 203, 206-7 
factor analysis of, 167 
uses of, 182-3, 191, 280 
„бее also Performance tests T 
Normal curve of error, see Norm. 
distribution 
Normal distribution, 
equation of, 58-9 Б 
application to reliability, д i 
application to school-mar А р 
‚ 20-1, 28, 87-42, 64, 66, 71-4, 
230, 290 ў 
application to test construction, 
21, 27-8, 34-0, 77, 210 
definition of, 19 
disturbing factors, 20-1 
examples of 19-20 
testing conformity to, 60-1, 102-3 
Norms, sce Mental testa SM 
Numerical description of distribu. 
tions, 43-4, 47, 48, 58-9, m 
Numerical evaluation of scholasti 
work, see Ability 


Ogive graph, 8n. 
Omnibus tests, 180-1, 213 


Parents’ attitudes to children's 
marks, 32, 35, 39-40, 69-70 


240, 


algebraic 


INDEX OF 


Passmarks, see Examinations 
Pearson-Bravais method, see Corre- 
lation 
Percentage marking, defects of, 
33-6; see also Ability, numeri- 
cal evaluation of 
Percentages, S.E. of, 99-100 
Percentile graphs, 67-9, 71-2 
Percentiles, 42, 44-8, 58-9 
comparison of distributions by, 
66-71, 77 _ 
sizes of units, 70, 80 А 
Performance tests, application of, 
181, 182, 185-6, 191, 194, 190, 
211, 216-7 
classified list of, 178, 208—4, 205n. 
construction of, 171-4 
factor analysis of, 167 
improvement on, with age, 84 
limitations of, 146, 185-6 
scoring of, 24, 87, 181, 211 
Perseveration tests, 119, 209 
Personality, study of, 170, 171, 185, 
194, 210, 277, 278; see also 
Character, Temperament 
Physical abilities, distribution of, 19 
measurement of, 27 
tests of, 153, 208 


Political opinions, correlation of, 122° 


distribution of, 102-3 
sealing of, 22 
Practical ability, 155, 175, 186, 195, 
217 
factor (F), 167, 186 " 
Practical teaching ability, 158, 280 
Practice, effects of, on intelligence 
tests, 191-3, 214, 282 
Practice sheets, 181, 211, 213-4 
Probability, integral tables and 
graph, 59-60, 61, 62, 70, 71-3, 
92, 94 х 
theory of, 58n., 61-2, 7 
103-4, 124, 126n., 248 
Probable Error (P.E.), 93-4, 1019. ; 
see also Reliability 
Product-moment method, see Corre- 
lation 
Product scales, see Quality scales 
Quality scales, 40, 172n., 177, 179- 
80, 198, 289 
Quartiles (Q, and Qs), 44-5, 48, 65, 
67, 69, 94 


Range of measures, PE 43, 48, 54, 
57, 61, 64 ; see also Dispersion 


5, 91-5, 


‚ Ranking of measures, 4, 118, 219; 


see also Marking 
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Rank-order method, see Correlation 
Rapport in testing, 190, 214-6 
Rate and power plans, see Ability 
Reading ability, factorial analysis 
of, 156-8 
marking of, 34—5, 67 
* tests of, 25, 26, 89, 174, 198-9, 
- 200-1 > 
Reading Age, 82 
Reasoning tests, 202, 205 
Recognition-type items, 192, 238-9, 
248, 249, 250, 264-74 
Regression, 142, 154 
equations, 134-6 
lines, 131-6, 154 
multiple, 144 
non-linear, 119 
Reliability, effects of function fluc- 
tuation, 148-9 
effects of heterogeneity, 147-8 
effects of length of test, 147 
formulæ for, 98-100, 126-7, 136, 


145, 147, 148 
of correlations, 88, 90-1, 101, 
< 126-9 o 
of differences, 2, 88-9, 91-8, 99- 
100, 128 


of distributions, 102-4 

of means, 88, 89-90, 98, 100 

of percentages, 99-100 

of predicted scores, 101-2, 185-8, 
141, 146-7 

of &mall samples, 100-1 

of standard deviations, 98-100 

of true scores, 146-7 

of z, 128 

re-test, 145, 148-0 

split-half, 145, 145-9 

theory of, 88-95 

See also Examinations, Mental 


tests е 
lo test items, 181, 918-4, 274 


Баро оё! th lation, un 
Samples of the population, - 
PEE 19, 21, 78-9, 82-3, 


88-9, 91, 95, 140-1, 235 
Sampling, see Ability 
Sampling errors, 75, P 

124, 126, 127-9, 146-7, 148, 

153, 158n., 161, 164, 217, 218 
‘Scatter of measures, see 
'"Scattergram, 106-7, 111-2, 115, 

116, 119, 131, 153-4 
Scholastic abilities, factorial analy- 

sis of, 105, 156-8, 165, 187 . 

measurement of, 32, 34, 41-2, 

180, 220, 226, 227-8, 230-1 


B 
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232-3, Ch. XIII, 277, 281, 
284-90 
tests of, 179, 198, 202, 208 
See also Arithmetic, Handwrit- 
ing, Reading, etc. 
Scientific approach to psycho- 
logical problems, v—vi, 3, 88-9, 
91, 93, 150-1, 169-70 
Scotland, use of English testsin, 81 
Secondary education, selection of 
pupils for, 37, 105, 138, 149, 
178, 191-2, 220, 221, 234-5, 
279-82, 283 
Semi-interquartile range (Q), 48, 
54, 57, 94 
Sensory-motor tests, viii, 177, 208 
Sex differences, in ability, 2, 18, 89, 
94, 99, 195 ; see also Intelligence 
in milk drinking, 99-100 
in scholastic interests, 120 
Skew distribution, see Frequency 
distribution 
Skewness, 47 
Spearmen-Brown  pronhecy for- 
mula, 147, 225 
Spelling ability, factorial analysis 
of, 156-7 
measurement of, 24, 26, 29, 31 
testa of, 25, 199, 201 
Spread of measures, sce Dispersion 
Standard Deviation (S.D ого), 42, 
43-4, 48, 54-09 
differences between, 99 
of averaged measures, 143 і 
of pr. Бога, 802, 83, 85-7 4 
use of, in combining marks, 64—6 
70-4, 77-9 ite 
Standard Error (S.E), 93-4, 
l01n.; see also Reliability 
Standard scales, see Marks andscores 
aig pears Поп, see Mental tests 
ards of marking, aee М. king 
of scholastic жо е 
Statistica] methods, definition of, 1 
limitations of, 8, 95, 


169-70 s 
objecta of, vi, 1-3 
Btatistical Significance, of а correla- 
tion, 127 


of а difference, 2 94-5, 96- 
99-100, 128-9 ^ 7 > 90-8, 


Tabulated measures, 
49-53 
correlation between, 111-7 
percentiles of, 45-6 
standard deviation of, 56-7 


average of, 


SUBJECTS 


Tabulation, 
uping 
Nem 2,4-8 __ 
Temperament tests, viii, 22, 146, 
177, 185, 209 
factor analysis of, 164 ч 
Tests, see Educational, Group in- 
telligence, Mental, otc. tests 
Tetrad difference criterion, 164 
Time limits in testing, 24-5, 181, 
188-9, 196, 213 ie 
Training Collego students, abilities 
of, 47-8, 89-90, 95, 125-0, 
156-7, 158, 162, 167, 280 
Trustworthiness of statistical re- 
sults, see Reliability 
Two-factor theory, 164-8 


coarse, see Coarse 


Unemployment and backwardness, 
139 


Unitary factor patterns, 163-4, 170, 
176 i 

United States, see American , 9 

University education, selection о 
students for, 3, 157, 220, 221, 
234-5, 257, 279-82 


Validity, see Examinations, Mental 
tests y 
Variability, see Dispersion, Re- 
liabilit; 
of у ЕЛЫ 1-2, 74-6, 151, 
156, 224-5, 226 
Variables, categoric, 120, 121, 122 
definition of, 3 
discrete and continuous, 4, 6, 60 
traits and abilities as, 22-3 


Variance, 126 


Verbal ability, see English 
factor (v), 167 i " 

Verbal tests, see Group intelligence 
testa 

Viva voce, see Examinations, oral 

Vocabulary, see Stanford-Binet tests 

Vocational abilities, factor analysis 
of, 158, 167-9 

Vocational guidance, 168-9, 170, 
188, 278-9, 283 

Vocational selection, 144-5, 169n., 
236, 279 

Vocational tests, viii, 144-5, 146, 
175, 177, 208 


Weighting, see Marks and scores 


z techniques, 126n., 127-9 


> 


Form No. 3, 


Bureau of Educational & Ps 


ychological 
Research Library. 
The book is to be returned within: 
the date stamped last. ^ * ~ 
16:5.69 7 3.16.6 
LRE. ¢ 0.» ПО | с м: 
1 3, JUN 1961 i». 
IB тЫ 
19 JUN 196] | 
ў 5 Jil jog тИ MM rr Я 
E Ў, EL а... 
LETS =з. И 
TEPER gay ends 
МИТУ pem 
ЕТЕУ Тее MM 
Im 2 a ee И 
2 1 JUN 1983 
УУВОР-59/60-51190.5ур 


PSY, RES.L.1 


